[comp.sys.apollo] corrupted link entries in sr9.7 directory

watcher@athena.mit.edu (chris ross) (01/10/90)

The Codex network management software I work with has been experiencing
some bizarre problems.  We have a large directory, linked to by
several nodes, which contains links pointing back at the /tmp
directories on those nodes: say, /top exists on each node, /top/foo is
a directory on //node_01 and a link to //node_01/top/foo on //node_02
thru //node_10.  The links in //node_01/top/foo look something like this:

	file_a	"//node_01/sys/node_data/tmp/file_a"
	file_b	"//node_02/sys/node_data/tmp/file_b"
	file_c	"//node_02/sys/node_data/tmp/file_c"
	file_d	"//node_03/sys/node_data/tmp/file_d"

and so on.  The routines which frob with these links are part of a
library inlib'ed by many different applications, so several processes
may be simultaneously attempting to modify the directory (which should
NOT cause problems.)

What happens is this: at some point when the system is running (on
several nodes), the link directory goes south.  The text of nearly
every link (although initially created with a valid pathname) becomes
garbled with random characters.  Running /com/sald has the effect of
deleting about half the links, and usually returning the remainder to
legal, though oddly truncated and nonexistent filenames:

	file_a	"ode_data/tmp/file_a"
	file_d	"ode_data/tmp/file_d"

and so on.

We have a similar but less frequent problem that occurs to a common
directory of log files, also linked to by each node in the net mgmt
setup.  Again, at some random time, a directory of previously valid
text files becomes corrupted such that

	(a) several files disappear completely
	(b) /com/ld can list each remaining file, but listing a file
		explicitly with "ld filename" yields "name not found"
	(c) ld -a shows "attributes unavailable" on some files
	(d) running /com/sald has no apparent effect.

Our software is running at SR9.7.5, on a random mix of DN30xx, DN35xx,
and DN45xx nodes.  We are getting no hard disk errors or strange
status messages from the DM.

Has anyone seen anything like this before, and found a way around it?
Offhand I'd say we have a name_$ call somewhere with invalid parameters
which are not being caught by the O/S, but such a glitch shouldn't
corrupt most of a directory.

Any help would be *greatly* appreciated.
thanx.

________________________________________________________________
chris ross  <0>  uunet!codex!watcher  or  watcher@athena.mit.edu