haynes%ucsce.ucscc.UUCP%Berkeley@ucscc.UUCP (02/05/85)
<> Under 4.2BSD: Sometimes we get a correctable memory error reported on the console, and it seems to persist until the system is rebooted. Presumably it is in an area of memory that is only read, such as the kernel or locore. This suggests that (1) The memory controller hardware/firmware doesn't write the corrected word back into memory when it corrects an error, but leaves the bad word in there. (2) An appropriate thing for the software to do in case of a correctable memory error would be to read the contents of the location and write it back, thus making the correction 'permanent'. Can anybody confirm that (1) is true and that (2) is appropriate? Has anybody patched the kernel code to do this? Is this problem peculiar to 750s? ucbvax!ucscc!haynes
charisse@infoswx.UUCP (02/22/85)
If you are using DEC memory, we have experienced the same thing. We get soft ecc errors on a particular board and they just go on an on. I've never tried just rebooting, our DEC rep said it was probably a heat problem, so we shut down the machine and yank each of the boards out and stuff them in again. We have run for as long as 10 days before bothering to shut the system down and have never had a hard error. I'd be interested in any information about this problem. Thanx charisse castagnoli {ihnp4, allegra!convex, ctvax}!infoswx!charisse