watcher@athena.mit.edu (chris ross) (01/10/90)
The Codex network management software I work with has been experiencing some bizarre problems. We have a large directory, linked to by several nodes, which contains links pointing back at the /tmp directories on those nodes: say, /top exists on each node, /top/foo is a directory on //node_01 and a link to //node_01/top/foo on //node_02 thru //node_10. The links in //node_01/top/foo look something like this: file_a "//node_01/sys/node_data/tmp/file_a" file_b "//node_02/sys/node_data/tmp/file_b" file_c "//node_02/sys/node_data/tmp/file_c" file_d "//node_03/sys/node_data/tmp/file_d" and so on. The routines which frob with these links are part of a library inlib'ed by many different applications, so several processes may be simultaneously attempting to modify the directory (which should NOT cause problems.) What happens is this: at some point when the system is running (on several nodes), the link directory goes south. The text of nearly every link (although initially created with a valid pathname) becomes garbled with random characters. Running /com/sald has the effect of deleting about half the links, and usually returning the remainder to legal, though oddly truncated and nonexistent filenames: file_a "ode_data/tmp/file_a" file_d "ode_data/tmp/file_d" and so on. We have a similar but less frequent problem that occurs to a common directory of log files, also linked to by each node in the net mgmt setup. Again, at some random time, a directory of previously valid text files becomes corrupted such that (a) several files disappear completely (b) /com/ld can list each remaining file, but listing a file explicitly with "ld filename" yields "name not found" (c) ld -a shows "attributes unavailable" on some files (d) running /com/sald has no apparent effect. Our software is running at SR9.7.5, on a random mix of DN30xx, DN35xx, and DN45xx nodes. We are getting no hard disk errors or strange status messages from the DM. Has anyone seen anything like this before, and found a way around it? Offhand I'd say we have a name_$ call somewhere with invalid parameters which are not being caught by the O/S, but such a glitch shouldn't corrupt most of a directory. Any help would be *greatly* appreciated. thanx. ________________________________________________________________ chris ross <0> uunet!codex!watcher or watcher@athena.mit.edu