marmen@is2.bnr.ca (Rob Marmen 1532773) (10/17/89)
Our site has approximately 440 Apollos on ethernet running sr10.1. The Apollos are setup to run BSD4.3 unix exclusively. AEGIS is not installed. We are experiencing some difficulties with registry nodes being excessively overworked and occassionally melting down. Apollo is recommending that we increase the number of registry servers. However, they cannot recommend what the proper ratio should be, nor can they tell us if the severe performance degradation will drop to an acceptable level. Word about the poor registry performance has gotten out to the users and they are refusing to have registries placed on their machines. Has anyone experienced this registry degradation? If so, what was the solution? Finally, what ratio of registry servers to nodes are you running? Our ratio is approximately 1/45. If I can get data on the proper ratio, then I'll be able to convince the users to have registries placed on their nodes. cheers, rob... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- | Robert Marmen marmen@bnr.ca OR | | Bell Northern Research marmen%bnr.ca@cunyvm.cuny.edu | | (613) 763-8244 |
pha@CAEN.ENGIN.UMICH.EDU (Paul H. Anderson) (10/20/89)
From: marmen@ucbvax.Berkeley.EDU Message-Id: <114@bnrgate.bnr.ca> Subject: Apollo SR10 Registry Performance Our site has approximately 440 Apollos on ethernet running sr10.1. The Apollos are setup to run BSD4.3 unix exclusively. AEGIS is not installed. We are experiencing some difficulties with registry nodes being excessively overworked and occassionally melting down. Apollo is recommending that we increase the number of registry servers. However, they cannot recommend what the proper ratio should be, nor can they tell us if the severe performance degradation will drop to an acceptable level. Word about the poor registry performance has gotten out to the users and they are refusing to have registries placed on their machines. Has anyone experienced this registry degradation? If so, what was the solution? Finally, what ratio of registry servers to nodes are you running? Our ratio is approximately 1/45. If I can get data on the proper ratio, then I'll be able to convince the users to have registries placed on their nodes. cheers, rob... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- | Robert Marmen marmen@bnr.ca OR | | Bell Northern Research marmen%bnr.ca@cunyvm.cuny.edu | | (613) 763-8244 | Here at UM, we are getting towards the end of switching over 500 nodes to SR10.1, and have dealt with a number of unusual problems in the transition. 1) the SR10 print server eliminated some optimizations for printing of bitmaps. While Apollo apparently intends to replace the lost speed, the stock 10.1 prsvr basically is quite slow. For what it is worth, Mentor Graphics, and probably other companies as well, haven't really caught up with the new NCS based printing scheme, so vendor's packages such as Mentor have had some problems with printing. 2) We broke the cvtrgy program (for converting sr10 registry to sr9 registry) by going over about 5600 accounts, which created a full_names file that was too large, and as a result broke one of the libraries on the 9.7 machines. This was fixed, but the patch won't show up until a later release than the stock 10.1 tapes. If you see a failure of this type (with large # of accounts), call Apollo tech support - they already have the fix. 3) Because the registry is now server based, rather than filesystem based (a really excellent idea!), strange things can happen that don't really show up very well, except as unstable load on the servers. In particular, if you have a single registry, and think that adding one additional slave will halve the load on the master, then think again. If one or the other becomes unavailable, the clients will time out, then switch to the other one, and not switch back until that choice times out. So what you tend to get is the entire network load switching from node to node. The solution is to make sure that you add enough servers to truly balance the load by offering a choice of remaining servers even if one goes down. In my opinion, three servers is not enough, even for only 100-150 nodes. Four servers may be a little marginal, and more would be better. After getting four or five servers, there is less need for additional ones, since as I mentioned before, the load will tend to be balanced better. Someone at Apollo mentioned something like 1 server for every 150 SR10 nodes, which is something I can agree with, but for the first 150 nodes, you probably want to see at least four servers. The performance of the registry depends largely on the offered load. In our environment (lots of student labs), we see a fair amount of registry activity, due to the dynamicly changing load caused by students moving between nodes at random (and doing it often - you can expect complete lab turnover in the 10 minutes between classes). So... if you are in a more stable environment, the servers will see less load. If you have a large registry, however, like we do (thousands of accounts), you can expect the server to pretty much eat up a DN3000 with 4 megs. A DN4000 with 8Megs can be used interactively, but is somewhat slower than I would like. But keep in mind that what I would really like is a desktop DN10000 (hint for Randy, my boss!). 4) The registry also had a problem that we are still working on, where creation of large numbers of accounts (300 per day) bogs the registry down to the point where we can do only a few transactions per minute. Since this may be related to problem 3) above, and because we can no longer easily test it, we expect to have to wait until next term before we get a high number of new accounts. Good luck! Paul Anderson CAEN Apollo Systems Programmer University of Michigan
dbfunk@ICAEN.UIOWA.EDU (David B Funk) (10/21/89)
WRT posting <114@bnrgate.bnr.ca> > Our site has approximately 440 Apollos on ethernet running sr10.1. > The Apollos are setup to run BSD4.3 unix exclusively. AEGIS is not > installed. We are experiencing some difficulties with registry nodes > being excessively overworked and occassionally melting down. I agree with all that Paul Anderson said <46574bed8.000f088@caen.engin.umich.edu> and would add: 1) "rgyd" can be a memory hog. It likes to keep the whole data base in memory and can use a lot. On our system (3000+ accounts) a "rgyd" uses about 6 Mb (based on a "ps agux"). It accounts for a lot of page faults on a system that has 8 Mb of memory (I hate to think of what it would to to a 4 Mb system :-). If possible, run it on a system that has plenty of RAM (particularly your master rgyd). Don't run it on the same node as other memory hungry daemons. EG put rgyd & glbd on different nodes. 2) If you have too many rgyds then your master rgyd will spend a lot of time (cpu & network) making sure that all the slaves stay in sync. A registry change (new account, password chage, shell change, etc) must be distributed to all the slaves. 3) There is a bug in the sr10.0 & sr10.1 rgyd implementation that can cause problems when your rgyds are left up for long periods of time on a system that is active. We have 3000+ accounts and do "cvtrgy"s daily. We've found that we have to reboot the master rgyd node at least once a month. This is supposed to be fixed at sr10.2. Dave Funk
marmen@is2.bnr.ca (Rob Marmen 1532773) (10/24/89)
Thanks for the reply. There are a couple of more pieces of information that are relevant. 1) We are running BETA sr10.2 regitries. This was needed to fix a problem with a server responding 20 times to each registry request. The new registries fix that problem. This was quite serious because we are 100% ethernet and the network load was phenominal. Also, HPApollo are working on an ethernet microcode fix to cure a problem with short interpacket spacing. 2) Our site is 100% BSD unix. This puts an additional load on the servers because the /etc files are now streams to the server. It's amazing how many programs look at these files constantly. The applications have not been modified to take in to account that there are now network accesses instead of disk accesses. Under sr9.7, we had a ratio of 1 registry for 35 nodes. In addition, the workstations were modified so that the /etc directory was resident on each nodes. Otherwise the performance was unacceptable. At this point, we are going to experiment with a ratio of 1 server for 20 workstations. It's just a guess, but we reason that since the unix programs are going to constantly look at the /etc files, then the number of servers should actually be greater than a sr9 site. Any comments? cheers, rob... PS. Again, thanks for the reply. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- | Robert Marmen marmen@bnr.ca OR | | Bell Northern Research marmen%bnr.ca@cunyvm.cuny.edu | | (613) 763-8244 My opinions are my own, not BNRs |
vince@bcsaic.UUCP (Vince Skahan) (10/24/89)
one registry site per 45 nodes sounds a bit (a lot) short to me. I'm running a rgyd on a dn3000 (8MB) that is the gateway to my ring and I see the effects now and then. I seem to remember the docs saying something like "use as many as you need". I'd guess that something like one rgyd per 25 nodes or so would be a bit more reasonable (though I'd like to hear what rule-of-thumb Apollo has too...especially regarding what system to use and how much memory). Based on what I've seen, the amount of memory and the number of other processes you have makes a big difference (for example, if you run "at" or "cron:, you'll see lots more effect than if you have a rgyd.). -- Vince Skahan Boeing Computer Services - Phila. (215) 591-4116 ARPA: vince@atc.boeing.com UUCP: bcsaic!vince Note: the opinions expressed above are mine and have no connection to Boeing...
pato@apollo.HP.COM (Joe Pato) (10/25/89)
In article <127@bnrgate.bnr.ca>, marmen@is2.bnr.ca (Rob Marmen 1532773) writes: > Thanks for the reply. There are a couple of more pieces of information > that are relevant. > > 1) We are running BETA sr10.2 regitries. This was needed to fix > a problem with a server responding 20 times to each registry > request. The new registries fix that problem. This was quite > serious because we are 100% ethernet and the network load was > phenominal. Also, HPApollo are working on an ethernet microcode > fix to cure a problem with short interpacket spacing. The multiple replies to a request bug existed in an ALPHA version of the sr10.2 registry server. You were running this to circumvent other problems in your network. > > 2) Our site is 100% BSD unix. This puts an additional load on the > servers because the /etc files are now streams to the server. > It's amazing how many programs look at these files constantly. > The applications have not been modified to take in to account > that there are now network accesses instead of disk accesses. > Most Apollo supplied application programs do not scan the /etc/passwd file, they use the standard UNIX passwd file data accessor functions getpwnam() and getpwuid(). These operations are implemented as remote calls to the registry server and avoid having to transmit the entire passwd file. Third party applications should also be modified to use these functions rather than directly reading the file. If you have several applications that are scanning the password file frequently, you might want to create local copies of the passwd file. (To do so, you would move the /etc/passwd object to /etc/passwd.real and copy the /etc/passwd.real file to /etc/passwd periodically from cron). At sr10.2 we have modified the /etc/passwd object to cache the passwd file locally. When the object is opened, the type manager determines if the cached copy is current by contacting the registry server, and if not it will cache a new copy. This has two benefits over sr10 and sr10.1 - 1) the data in the passwd file is now current instead of potentially being 2 hours out of date and 2) the data is often retrieved from the local disk rather than across the network. The latter is especially true if there are infrequent updates to the passwd file - relative to the number of times the file is scanned. > Under sr9.7, we had a ratio of 1 registry for 35 nodes. In addition, the > workstations were modified so that the /etc directory was resident on each > nodes. Otherwise the performance was unacceptable. > > At this point, we are going to experiment with a ratio of 1 server for 20 > workstations. It's just a guess, but we reason that since the unix programs > are going to constantly look at the /etc files, then the number of servers > should actually be greater than a sr9 site. > > Any comments? cheers, rob... > > PS. Again, thanks for the reply. > > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > | Robert Marmen marmen@bnr.ca OR | > | Bell Northern Research marmen%bnr.ca@cunyvm.cuny.edu | > | (613) 763-8244 My opinions are my own, not BNRs | I firmly believe that 1 registry server for 20 workstations is tremendous overkill. In the Apollo corporate internetwork we run 1 server for every 100 workstations. (And we only run so many servers because our internetwork stretches over 2 states and we want to have a server in every "important" network for reliability. We could get by on many fewer servers.) Statistics gathered over an average 2 business day period indicate that our corporate internetwork is handling about 25 registry operations per second. Peak access to the registry servers is between 35-40 operations per second. Stress tests on servers indicate that a server running on an unloaded 8 Mbyte DN2500 should be able to handle about 64 operations per second before clients perceive any real degradation in service. (This would indicate that a single dn2500 should be able to service our entire 3500+ node internetwork. Again, we would not want to do this for reliability reasons and due to the additional traffic from remote networks.) Paul Anderson's observations about the U. Mich. environment seem to be appropriate here. He observes that 1 server for every 150 machines seems adequate, but that you really need 4 servers for the first 150 machines for the load balancing strategy to work well. Joe Pato Apollo Computer A Subsidiary of Hewlett-Packard NSFNET: pato@apollo.com UUCP: ...{attunix,uw-beaver,brunix}!apollo!pato
pato@apollo.HP.COM (Joe Pato) (10/27/89)
In article <16155@bcsaic.UUCP>, vince@bcsaic.UUCP (Vince Skahan) writes: > one registry site per 45 nodes sounds a bit (a lot) short to me. I'm > running a rgyd on a dn3000 (8MB) that is the gateway to my ring and I > see the effects now and then. > > I seem to remember the docs saying something like "use as many as you > need". I'd guess that something like one rgyd per 25 nodes or so would > be a bit more reasonable (though I'd like to hear what rule-of-thumb > Apollo has too...especially regarding what system to use and how much > memory). > > Based on what I've seen, the amount of memory and the number of other > processes you have makes a big difference (for example, if you run "at" > or "cron:, you'll see lots more effect than if you have a rgyd.). > -- > Vince Skahan Boeing Computer Services - Phila. > (215) 591-4116 ARPA: vince@atc.boeing.com > UUCP: bcsaic!vince > Note: the opinions expressed above are mine and have no connection to Boeing... As I've state before, we expect that you can run about 100 machines per registry server. We do, however, expect that those registry servers are running on server nodes - not nodes in personal use. A slow machine (e.g., dn3000) with a reasonable amount of memory (>= 8 Mbytes) should do just fine. Joe Pato Apollo Computer A Subsidiary of Hewlett-Packard NSFNET: pato@apollo.com UUCP: ...{attunix,uw-beaver,brunix}!apollo!pato