[mod.computers.apollo] File corruption.

burati@ULOWELL.CSNET (Michael Burati) (02/10/86)

We have been having a small problem lately with files becoming corrupt
periodically for no visible reason.  The first couple of times it happened,
I assumed that it might be due to a couple of network crashes (coax problem)
that coincidentally happened while the files were being transferred.

	This is the problem;
A file on a DFS500 could not be edited from another node.  After about a minute
of hanging, every time we tried to access the file with the AEGIS editor,
the error "Remote node failed to respond to request" appeared.  The file still
showed up in the directory listing, but sometimes it would return with
attributes unavailable (using $ ld -a) (note: all the other files in the
directory were accessible).  Using "catf", the file would print about halfway
through, then hang for a minute, and return with "Remote node failed ...".
Using the "vi" editor, I finally managed to read in the file, advance to
the end of file, then write the file back out. This appeared to solve the
problem, but when we tried to compile the file, we realized that about 50 lines
were then missing from the middle of the file.  What's happening ??


Any insight to this problem, with possible solutions would be appreciated.
Note that this happened several times, over the last few months, on
different nodes. (Also note that the network hasn't hung since the first
occurence that I mentioned).

Thanks in advance.

Mike Burati
University of Lowell
Comp Sci Dept
Lowell MA

UUCP: ..wanginst!ulowell!burati
CSNET: burati@ulowell

Mark_Giuffrida@UMICH-MTS.MAILNET (02/11/86)

We have noticed that same problem on occasion on just one of
our machines.   When it happens, no matter where you are in the
network, you always get the "remote node failed to respond..." msg.
As I said, it happens to just *one* of our machines, and no others.
I haven't seen the problem in a month or so.  I can only recommend
2 things which helped us:

1) Use DMPF on the corrupted file and see if the header (the first
   32 bytes) have been corrupted -- i.e., all zeros.  If so, then the
   file should be "recreated".
2) Use the NETSTAT command during these times when the files aren't
   accessible and check to see if the "Last ring hardware failure..."
   message has the current time.  We had a temperature sensitive
   board (when the temp got above 75, it would act up) on one of our
   DN660's.  It caused subtle network problems and sometimes heavy
   network problems.  We traced one instance of the "remote node failed..."
   msg to this when the network was fine for all other nodes.
                                 Mark Giuffrida
                                 CAEN - University of Michigan

Giebelhaus@HI-MULTICS.ARPA (02/11/86)

It sounds like you are still having network problems.  Have you done a
probenet or checked netmain?  You probably lost the 50 lines when vi
tried to put the file (or page of the file) back.