burati@ULOWELL.CSNET (Michael Burati) (02/10/86)
We have been having a small problem lately with files becoming corrupt periodically for no visible reason. The first couple of times it happened, I assumed that it might be due to a couple of network crashes (coax problem) that coincidentally happened while the files were being transferred. This is the problem; A file on a DFS500 could not be edited from another node. After about a minute of hanging, every time we tried to access the file with the AEGIS editor, the error "Remote node failed to respond to request" appeared. The file still showed up in the directory listing, but sometimes it would return with attributes unavailable (using $ ld -a) (note: all the other files in the directory were accessible). Using "catf", the file would print about halfway through, then hang for a minute, and return with "Remote node failed ...". Using the "vi" editor, I finally managed to read in the file, advance to the end of file, then write the file back out. This appeared to solve the problem, but when we tried to compile the file, we realized that about 50 lines were then missing from the middle of the file. What's happening ?? Any insight to this problem, with possible solutions would be appreciated. Note that this happened several times, over the last few months, on different nodes. (Also note that the network hasn't hung since the first occurence that I mentioned). Thanks in advance. Mike Burati University of Lowell Comp Sci Dept Lowell MA UUCP: ..wanginst!ulowell!burati CSNET: burati@ulowell
Mark_Giuffrida@UMICH-MTS.MAILNET (02/11/86)
We have noticed that same problem on occasion on just one of our machines. When it happens, no matter where you are in the network, you always get the "remote node failed to respond..." msg. As I said, it happens to just *one* of our machines, and no others. I haven't seen the problem in a month or so. I can only recommend 2 things which helped us: 1) Use DMPF on the corrupted file and see if the header (the first 32 bytes) have been corrupted -- i.e., all zeros. If so, then the file should be "recreated". 2) Use the NETSTAT command during these times when the files aren't accessible and check to see if the "Last ring hardware failure..." message has the current time. We had a temperature sensitive board (when the temp got above 75, it would act up) on one of our DN660's. It caused subtle network problems and sometimes heavy network problems. We traced one instance of the "remote node failed..." msg to this when the network was fine for all other nodes. Mark Giuffrida CAEN - University of Michigan
Giebelhaus@HI-MULTICS.ARPA (02/11/86)
It sounds like you are still having network problems. Have you done a probenet or checked netmain? You probably lost the 50 lines when vi tried to put the file (or page of the file) back.