williams@edsews.eds.com (Dave Williams) (05/05/90)
I would like to know what an Apollo IOT trap is, what causes it, and how to get rid of it. We have some unix shell scripts that add/change/delete user accounts on the Apollos. Twice, after a test periods of about 2 months, these IOT traps start showing up. For about 2 months, testing goes along fine. Then all of a sudden, IOT traps show up. We are only changing our own test accounts and not the root accounts. We have not been able to find out if IOT traps are related to: registries nfs remsh other/all? - We are logged onto a machine (may or may not be an Apollo) as root. - The machine we are logged into is a trusted host to the Apollo machine. - We remsh into an Apollo to access a nfs mounted file system and receive an IOT trap. Example: remote_machine_prompt==> remsh apollo_node ls /nfs_mounted_filesystem This even happens if we are logged onto the Apollo that we are trying to remsh to. I am wondering if the registries cannot handle 30-50 changes per day over this length of time. Any other ideas? How do we get rid of this problem when it shows up? So far, the only way we have found to get rid of the problem is to completely rebuild the registries. -- Dave Williams | My opinions are not necessarily those EDS Technical Systems Development | of my employer. williams@eds.com | UUCP: {uunet|sun|sharkey}!edsews!williams
pha@CAEN.ENGIN.UMICH.EDU (Paul H. Anderson) (05/08/90)
I am wondering if the registries cannot handle 30-50 changes per day over this length of time. Any other ideas? We ran into several bugs of this nature - if the registry is up for long periods of time, it starts having problems. The fix for us consisted of two parts - a fix to rgyd that eliminated a malloc/free problem where memory was allocated, but never freed (hence you get large VM space used after large #'s of changes). The other part of the fix was a change in some inefficient hash structures that didn't work at all well for UM. The problem is we have 7000 accounts, but all in one organization, and the registry was originally designed to assume maybe a few hundred accounts per organization (%.%.org). The first fix appears to have eliminated all IOT deaths of the registry, and the second patch sped account updates up by a factor of several hundred. The big speedup comes when adding hundreds of accounts in a row - the old hash structure algorithm broke down, making each new addition increasingly expensive. By the time something like 50 accounts had been added to our registry, each new account was taking an hour or more. These two problems still exist, even in the early beta SR10.3 that we have. How do we get rid of this problem when it shows up? So far, the only way we have found to get rid of the problem is to completely rebuild the registries. This sounds a little extreme - you should not have to go to this length to get a working registry. You might want to check that you are not running mixed SR10.1 and SR10.2 registry daemons, since they are incompatible. Other things to try are to use only one registry daemon for awhile, since the propogation algorithms sometimes have problems in a flakey network. -- Dave Williams | My opinions are not necessarily those EDS Technical Systems Development | of my employer. williams@eds.com | UUCP: {uunet|sun|sharkey}!edsews!williams Paul Anderson CAEN Systems Programmer University of Michigan
marmen@bcara128.bnr.ca (Rob Marmen 1532773) (05/08/90)
In article <4a452733b.000f088@caen.engin.umich.edu>, pha@CAEN.ENGIN.UMICH.EDU (Paul H. Anderson) writes: > > > These two problems still exist, even in the early beta SR10.3 that we have. > Thanks for the information. We also experienced the same problems. In our case, the problem was compounded by a very large single ethernet (i.e. not really flaky but with a LOT of varied traffic on it). We experience the MALLOC problem, as well as a problem with the registry crashing intermittently. We are also running Apollo Beta regitries. Having finally gotten fed up with the whole mess, I but together a bandaid solution which appears to be maintaining the status quo. I have a script which is executed by CRON every hour. This script checks for a rgyd process. If it does not find a rgyd, it restarts the rgyd. If rgyd is running, it checks the size of the daemon. If the daemon execeeds 16M (i.e. double the size), the script stops the daemon and restarts it. A time/date stamp is placed in a file in `node_data/system_logs to record the event. Since doing this, our registries have been much better behaved. I recently did an audit of the logs. Since segmenting the network into 6 separate Apollo networks, the registries have hardly crashed at all. The new rgyds also appear to be slowing down in their appetite for memory. However, I still need to do a more detailed study. If anyone is interested, I will post the script. It runs on BSD4.3 and is pretty simple. No support of course ;-). -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- | Robert Marmen marmen@bnr.ca OR | | Bell Northern Research marmen%bnr.ca@cunyvm.cuny.edu | | (613) 763-8244 My opinions are my own, not BNRs |