[comp.sys.apollo] How can I get rid of Apollo IOT traps?

williams@edsews.eds.com (Dave Williams) (05/05/90)

I would like to know what an Apollo IOT trap is,
what causes it, and how to get rid of it.

We have some unix shell scripts that add/change/delete user accounts on the
Apollos. Twice, after a test periods of about 2 months, these IOT traps start
showing up. For about 2 months, testing goes along fine. Then all of a sudden,
IOT traps show up. We are only changing our own test accounts and not the root
accounts.

We have not been able to find out if IOT traps are related to:

                         registries
                         nfs
                         remsh
                         other/all?

- We are logged onto a machine (may or may not be an Apollo) as root.
- The machine we are logged into is a trusted host to the Apollo machine.
- We remsh into an Apollo to access a nfs mounted file system and receive an IOT
  trap.

 Example:  remote_machine_prompt==> remsh apollo_node ls /nfs_mounted_filesystem

This even happens if we are logged onto the Apollo that we are trying to remsh
to.

I am wondering if the registries cannot handle 30-50 changes per day over this
length of time. Any other ideas?

How do we get rid of this problem when it shows up? So far, the only way we have
found to get rid of the problem is to completely rebuild the registries.
-- 
Dave Williams                       |  My opinions are not necessarily those
EDS Technical Systems Development   |  of my employer.
williams@eds.com                    |
UUCP: {uunet|sun|sharkey}!edsews!williams

pha@CAEN.ENGIN.UMICH.EDU (Paul H. Anderson) (05/08/90)

	 
	I am wondering if the registries cannot handle 30-50 changes per day over this
	length of time. Any other ideas?

We ran into several bugs of this nature - if the registry is up for long periods
of time, it starts having problems.  The fix for us consisted of two parts - a
fix to rgyd that eliminated a malloc/free problem where memory was allocated, but
never freed (hence you get large VM space used after large #'s of changes).  The
other part of the fix was a change in some inefficient hash structures that didn't
work at all well for UM.  The problem is we have 7000 accounts, but all in one
organization, and the registry was originally designed to assume maybe a few hundred
accounts per organization (%.%.org).

The first fix appears to have eliminated all IOT deaths of the registry, and the
second patch sped account updates up by a factor of several hundred.  The big speedup
comes when adding hundreds of accounts in a row - the old hash structure algorithm
broke down, making each new addition increasingly expensive.  By the time something
like 50 accounts had been added to our registry, each new account was taking an
hour or more.

These two problems still exist, even in the early beta SR10.3 that we have.
	 
	How do we get rid of this problem when it shows up? So far, the only way we have
	found to get rid of the problem is to completely rebuild the registries.

This sounds a little extreme - you should not have to go to this length to get
a working registry.  You might want to check that you are not running mixed SR10.1
and SR10.2 registry daemons, since they are incompatible.  Other things to try
are to use only one registry daemon for awhile, since the propogation algorithms
sometimes have problems in a flakey network.

	-- 
	Dave Williams                       |  My opinions are not necessarily those
	EDS Technical Systems Development   |  of my employer.
	williams@eds.com                    |
	UUCP: {uunet|sun|sharkey}!edsews!williams
	

Paul Anderson
CAEN Systems Programmer
University of Michigan

marmen@bcara128.bnr.ca (Rob Marmen 1532773) (05/08/90)

In article <4a452733b.000f088@caen.engin.umich.edu>, pha@CAEN.ENGIN.UMICH.EDU (Paul H. Anderson) writes:
> 
> 
> These two problems still exist, even in the early beta SR10.3 that we have.
> 	 
   Thanks for the information. We also experienced the same problems. In our case,
   the problem was compounded by a very large single ethernet (i.e. not really
   flaky but with a LOT of varied traffic on it). We experience the MALLOC problem,
   as well as a problem with the registry crashing intermittently. We are also running
   Apollo Beta regitries.

   Having finally gotten fed up with the whole mess, I but together a bandaid solution
   which appears to be maintaining the status quo. I have a script which is executed
   by CRON every hour. This script checks for a rgyd process. If it does not find a
   rgyd, it restarts the rgyd. If rgyd is running, it checks the size of the daemon.
   If the daemon execeeds 16M (i.e. double the size), the script stops the daemon
   and restarts it. A time/date stamp is placed in a file in `node_data/system_logs
   to record the event. Since doing this, our registries have been much better behaved.

   I recently did an audit of the logs. Since segmenting the network into 6 separate
   Apollo networks, the registries have hardly crashed at all. The new rgyds also 
   appear to be slowing down in their appetite for memory. However, I still need to
   do a more detailed study.

   If anyone is interested, I will post the script. It runs on BSD4.3 and is pretty
   simple. No support of course ;-).
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
| Robert Marmen             marmen@bnr.ca  OR             |
| Bell Northern Research    marmen%bnr.ca@cunyvm.cuny.edu |
| (613) 763-8244         My opinions are my own, not BNRs |