malloy@ittral.UUCP (William P. Malloy) (06/21/85)
We ran into a rather odd little problem here a few days ago. And would like someone elses opinion. SYSTEM description: Sun 2/170 running Sun release 1.2 (4.2bsd binary) 4 Meg memory 1 Xylogics 450 SMD disk controller 1 Eagle (Fujitsu M 2351A) disk drive 1 SKY FPA 1 3COM Ethernet board 1 Systech Terminal Mux (16 ports) In short a small 16 user computer with virtual memory and a big disk. Problem Description: vi dumps core, repeatedly and reliably. I.e. it doesn't work. How it started: Someone here had just ignored the old saying about not trying to fix that which is not broken, and had just rebooted all our Sun 2/170's (in order to force fsck to check the disks) when the problem started (vi core dumping). NOTE the following: 1) Binary hadn't been modified at all since pulled off the Sun release tape. 2) Rcp-ed vi from another Sun over ethernet and did a ``cmp'' on the two binarys and noted they were EXACTLY the same. 3) Tried running the vi binary from the other Sun, noted it works FINE. 4) Tried copying the vi binary to another file, works FINE. 5) Get clever. Notice vi has 6 links, make 5 links to the other Suns vi and run it. Works FINE. 6) Get cleverer still. Notice vi has sticky bit turned on, so turn on the sticky bit of the other vi, try that. Works FINE. 7) Rebooted system, problem goes away. <************* NOTE ************* Needless to say we were ANNOYED. The only thing worse then a bug, is a non- repeatable bug. And the only thing worse than that is a repeatable, but incomprehensible one. The only thing I can think of is that vi (perhaps because of the sticky bit) had been loaded into memory and retained there by UNIX. And some location in memory where vi had been put was BAD. Sun is well known for having memory problems. However we had no Memory errors in our /usr/adm/messages file (at least none recently). Anybody out there want to guess at another explanation. It's not to important so long is it doesn't repeat itself, but it is ODD. BTW To people who have Suns. In section 8 of the manual Sun has a standalone command called ``imemtest'' which is supposed to run a memory testing routine. The man page says it's interactive and self-explanatory. Well it doesn't work for me. Anyone ever get this to work (or is it just some nonsense from Sun that is best deleted?) -- Address: William P. Malloy, ITT Telecom, B & CC Engineering Group, Raleigh NC {ihnp4!mcnc, burl, ncsu, decvax!ittvax}!ittral!malloy
reno@bunker.UUCP (Jim Reno) (06/26/85)
Oddly, we recently had a similar problem on a VAX, also with 'vi'. Since the sticky bit is on, a copy of the text is being maintained in the swap area (even when nobody is executing it). Later invocations result in the text being read from the swap area, not the binary file. Hence if the image in swap has a bad block, it will bomb, even if the original binary is still good. (Our situation was caused by a hardware problem in accessing the swap area). Of course, this just changes the problem to 'why did the swap image get corrupted?'
tim@cithep.UucP (Tim Smith ) (06/27/85)
The problem lies, I believe, with the sticky bit. You mention that vi has the sticky bit on. Assuming that 4.2bsd treats this bit the same was that UNIX does, then this means that there will be a copy of the text of vi on the swapping device ( or maybe pageing device ). Assume that this copy is bad. I think all your symptoms are explained then: > 1) Binary hadn't been modified at all since pulled off the Sun release tape. > 2) Rcp-ed vi from another Sun over ethernet and did a ``cmp'' on the two > binarys and noted they were EXACTLY the same. Fine. cmp opens /bin/vi, and gets the data from the filesystem, which has not been trashed. > 3) Tried running the vi binary from the other Sun, noted it works FINE. > 4) Tried copying the vi binary to another file, works FINE. To be expected, since cp gets the data from the filesystem. > 5) Get clever. Notice vi has 6 links, make 5 links to the other Suns vi and > run it. Works FINE. > 6) Get cleverer still. Notice vi has sticky bit turned on, so turn on the > sticky bit of the other vi, try that. Works FINE. All this is being done to the new vi. The old one is still sitting wounded on your swapping or pageing area(s). > 7) Rebooted system, problem goes away. <************* NOTE ************* Rebooting makes the system forget the bad copy on the swapping area. -- Tim Smith ihnp4!{wlbr!callan,cithep}!tim
chris@umcp-cs.UUCP (Chris Torek) (06/27/85)
(This is not an attempt at an explanation.) The sticky bit doesn't keep things in core; it keeps them around in swap space once they've made it out there. (They will hang around in the buffer cache for a short while, but at 200K (for a 2M Sun) that isn't usually very long.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland
thomas@utah-gr.UUCP (Spencer W. Thomas) (06/27/85)
I would bet on a bad spot in the swap area (or memory). As Chris points out, the sticky bit means "keep a copy of this program in swap space, even if nobody is running it". If the "running" copy of the program got corrupted, and then swapped out, you would keep on using the corrupted copy. A process is loaded into the swap area by first reading it into memory, then writing it out to swap, so if you had a bad memory location, it would be possible to corrupt the swap copy. Things you could try if it happens again: 1. Turn off the sticky bit and run vi a couple of times. I think the system will throw away the copy in the swap area, so you should be getting a clean copy. 2. Write a program that uses ptrace() to compare a "running" copy of vi to the binary (the running copy should get the bad swap version). -- =Spencer ({ihnp4,decvax}!utah-cs!thomas, thomas@utah-cs.ARPA)
terryl@tekcrl.UUCP () (06/27/85)
>Problem Description: > vi dumps core, repeatedly and reliably. >How it started: >Someone here had just ignored the old saying about not trying >to fix that which is not broken, and had just rebooted all our Sun 2/170's >(in order to force fsck to check the disks) when the problem started >(vi core dumping). >NOTE the following: >1) Binary hadn't been modified at all since pulled off the Sun release tape. >2) Rcp-ed vi from another Sun over ethernet and did a ``cmp'' on the two > binarys and noted they were EXACTLY the same. >3) Tried running the vi binary from the other Sun, noted it works FINE. >4) Tried copying the vi binary to another file, works FINE. >5) Get clever. Notice vi has 6 links, make 5 links to the other Suns vi and > run it. Works FINE. >6) Get cleverer still. Notice vi has sticky bit turned on, so turn on the > sticky bit of the other vi, try that. Works FINE. >7) Rebooted system, problem goes away. <************* NOTE ************* We saw this happen once on a VAX 11/780 with cp, of all programs!!! We did exactly the same things that you did, with the exception of making the multiple links(cp has no links). The problem disappeared on reboot. What we think happened (and we were never able to prove it since the problem never re-occurred), is that 4.2 (and 4.1, and 4.0) does not give up pages of pure- text programs when the program exits, but marks them as pages of pure-text programs in case if the program is run again soon, then the page does not have to be read in again from disk. The theory is that the page somehow got corrupted before its next use, but hadn't been reclaimed for some other use, thus invalidating the incore copy of that text page. Unfortunately, when a program dies and dumps core, only the data portion of the program is dumped, and not the pure-text portion, so if one wanted to verify this theory, one would have to get a core dump of PHYSICAL memory before rebooting and poke around in there. We didn't think of this(and even if we did, we didn't feel like poking around in PHYSICAL memory to test this theory). Terry Laskodi of Tektronix
guy@sun.uucp (Guy Harris) (06/29/85)
> What we think happened (and we were never able to prove it since the problem > never re-occurred), is that 4.2 (and 4.1, and 4.0) does not give up pages of > pure-text programs when the program exits, but marks them as pages of > pure-text programs in case if the program is run again soon, then the page > does not have to be read in again from disk. This is, indeed the case. (Try bringing up a 4.xBSD machine up single-user, timing a compile, and then timing the same compile. The second one will be faster, both due to the pagein of the compiler passes being bypassed and due to the inode for the passes and input files (and their directory entries, in 4.3BSD) being in an in-core cache. Of course, if you have so little physical memory that each pass of the compiler flushes the previous one out of memory, you won't get any speedup; the blocks won't be in the buffer cache because pageins don't go through the buffer cache.) Guy Harris