dan@SCI.CCNY.CUNY.EDU (Dan Schlitt) (08/14/90)
A few days ago I posted a question to the BIND list. I got a few responses but not as much as I expected. This mailinglist is a little more active than the BIND list so maybe I can get a few more ideas from it. /dan To: bind@ucbarpa.Berkeley.EDU Subject: How do YOU tell if named has died? Two events yesterday prompt me to ask the assembled multitude the question in the subject line. First, I discovered that while I was away on vacation named on one of our hosts died and nobody here at the time noticed its passing. It didn't cause problems for anyone because it is only doing DNS for our local domain and is one of two secondaries. The primary and the off-site secondary were both doing their job. But we might not have been so lucky. Second, I read J. Van Bokkelen's new RFC (rfc1173) and was reminded of my responsibilities (which I obviously had been shirking). So how do folks arrange to get automatic notification in a timely way when their nameserver software dies? Answers for diverse hardware running unix for me, but others may be interested in other cases. /dan
mlindsey@x102c.harris-atd.com (Lindsey MS 04396) (08/15/90)
In article <9008141525.AA27754@sci.ccny.cuny.edu> dan@SCI.CCNY.CUNY.EDU (Dan Schlitt) writes: >So how do folks arrange to get automatic notification in a timely way >when their nameserver software dies? Answers for diverse hardware >running unix for me, but others may be interested in other cases. We have a shell script that cron runs every minute to check named, log errors, and restart named if applicable. It's not very sexy, but it works ok. Since the shell script is so short, I've included it below: ################### cut here #################################### #!/bin/sh # Program: check_named # Purpose: to ensure named it always running. # Code date: 3/2/90 # # Revision History: # # Date Name Description of change # if [ -f /etc/named/core ] then rm /etc/named/core fi if [ -f /etc/named.pid ] then PID=`cat /etc/named.pid` ANSWER=`ps $PID | grep -c in.named` if [ "$ANSWER" = 0 ] then /usr/etc/in.named date >> /etc/named/named_downtime fi else /usr/etc/in.named date >> /etc/named/named_downtime fi exit 0 ################### cut here #################################### "Waste your brain, wax your board, and pray for waves!" Woody in E.G.A.E. /earth is 98% full! Please delete anyone you can! (anonymous) $teve Lindsey |-) uunet!x102a!mlindsey (407) 727-5893 :-) mlindsey@x102a.ess.harris.com
david@twg.com (David S. Herron) (08/17/90)
In article <9008141525.AA27754@sci.ccny.cuny.edu> dan@SCI.CCNY.CUNY.EDU (Dan Schlitt) writes: >Subject: How do YOU tell if named has died? >So how do folks arrange to get automatic notification in a timely way >when their nameserver software dies? Answers for diverse hardware >running unix for me, but others may be interested in other cases. A quick hack would be to have a cron job on occasion which either checks for the existance of critical processes & start's 'em up. Or just start's em & lets the processes fight over how many of which kind are to be running. Buuuut.. There's a generic problem with the way daemon's are done in Unix whose issue is beyond `name service'. That is that the daemons are processes spun off into the background and not watched after. [So therefore I'm cross-posting to unix-wizards..] If `cron' dies the system is just as crippled, though in a different way. And random people are just as likely to notice cron dieing as they do when named dies now. Something on my long and varied list of Things To Do (but haven't done yet) is to write a program (name: respawn, or daemond) which watches after generic processes. As opposed to init which is suited to watching after /etc/getty's. This process will somehow take a list of processes to watch after. It will be the parent of all those processes, so that it will be notified of them dieing .. It will have a number of actions it can do when the process dies, like wait awhile before starting a new copy, start one immediately, start one under some condition, etc. This is different from init in that init is rather specific to watching after getty's. Even the SysV version of init .. though the configurability of /etc/inittab gets close to what I have in mind. This is different from inetd in that inetd is specific to network services. `cron' is not a network service, yet it also needs to be watched over in this way. Also inetd is suited to a situation where it starts up a fresh process for each connection -- in the particular case of named this is bad because named needs to be running all the time. At the moment we're relying on the hopefull assumption of a lack of bugs in these background daemons. (Where's some wood to knock on??) -- <- David Herron, an MMDF weenie, <david@twg.com> <- Formerly: David Herron -- NonResident E-Mail Hack <david@ms.uky.edu> <- <- Sign me up for one "I survived Jaka's Story" T-shirt!
emv@math.lsa.umich.edu (Edward Vielmetti) (08/17/90)
In article <7769@gollum.twg.com> david@twg.com (David S. Herron) writes: >So how do folks arrange to get automatic notification in a timely way >when their nameserver software dies? Answers for diverse hardware >running unix for me, but others may be interested in other cases. A quick hack would be to have a cron job on occasion which either checks for the existance of critical processes & start's 'em up. Or just start's em & lets the processes fight over how many of which kind are to be running. Buuuut.. Ummmm.... I would think that the right way of managing these things would be to embed into them some piece of SNMP (the Simple Network Management Protocol), and then have them all watched over by a network management station which could get traps when the daemons die, arrange to have things restarted, etc etc. This would have the advantage of letting you watch over a bunch of systems from a single vantage point if you wanted to. That said, I must confess that I don't know whether anyone has stuck SNMP into bind (anyone?), whether there's a MIB defined, or anything like that. I know that Sun doesn't ship it in SunOS 4.0.3, and that the DEC Ultrix 4.0 snmp/bind stuff appears to be amenable to such treatment but hasn't been done. --Ed Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>
swatt@NOC.NET.YALE.EDU (Alan S. Watt) (08/17/90)
It seems obvious the only meaningful way to monitor a nameserver is toss requests at it and check the results. The Internet Rover package you can get free from Merit has code to do this. You can also use or perhaps modify dig (Domain Internet Groper) to the same end. I have Rover running to monitor all the campus nameservers. Every few minutes or so it will hit them with a "1.0.0.127.in-addr.arpa" lookup request. This of course doesn't tell you everything is working properly, but if a nameserver can't manage this much, it's time to look at it. So far, I haven't seen any nameserver failures which can still pass this test. - Alan S. Watt High Speed Networking, Yale University Computing and Information Systems Box 2112 Yale Station New Haven, CT 06520-2112 (203) 432-6600 X394 Watt-Alan@Yale.Edu Disclaimer: "Make Love, Not War -- Be Prepared For Both" - Edelman's Sporting Goods [and Marital Aids?]
dan@SCI.CCNY.CUNY.EDU (Dan Schlitt) (08/18/90)
A quick hack would be to have a cron job on occasion which either checks for the existance of critical processes & start's 'em up. Or just start's em & lets the processes fight over how many of which kind are to be running. Buuuut.. I think starting multiple nameds is a bad idea. The second one finds that the port is busy and hangs around doing nothing worthwhile. Ummmm.... I would think that the right way of managing these things would be to embed into them some piece of SNMP (the Simple Network Management Protocol), and then have them all watched over by a network management station which could get traps when the daemons die, arrange to have things restarted, etc etc. This would have the advantage of letting you watch over a bunch of systems from a single vantage point if you wanted to. That said, I must confess that I don't know whether anyone has stuck SNMP into bind (anyone?), whether there's a MIB defined, or anything like that. I know that Sun doesn't ship it in SunOS 4.0.3, and that the DEC Ultrix 4.0 snmp/bind stuff appears to be amenable to such treatment but hasn't been done. --Ed Ed, That is the kind of thing that I was hoping to find out about. Sending signals to named and using the returned status lacks something in elegance. /dan Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>
yenbut@CS.WASHINGTON.EDU (08/18/90)
Alan S. Watt writes: > So far, I haven't seen any nameserver failures which can still > pass this test. We have seen a problem at least once when only one domain name is checked. We used to check only a domain name at the top of our authoritative zone, "CS.Washington.EDU", but one day "Washington.EDU" which we do zone transfer from a campus nameserver for some reasons quit working. Now, we check both zones once a day (We depend on nameservers backing up one another. Our resolver code sends a query to one nameserver at a time. It waits a few seconds for an answer from the first nameserver before sending the query to the second one. We know almost right away when the first nameserver on a machine we are using is down by a feeling of delay). We don't check on "1.0.0.127.in-addr.arpa". Voradesh Yenbut <yenbut@cs.washington.edu> CSE Department, Univ of Washington, Seattle, WA
pushp@nic.cerf.net (Pushpendra Mohta) (08/18/90)
In article <EMV.90Aug17083504@urania.math.lsa.umich.edu> emv@math.lsa.umich.edu (Edward Vielmetti) writes: >Ummmm.... >I would think that the right way of managing these things would be to >embed into them some piece of SNMP (the Simple Network Management Protocol), >and then have them all watched over by a network management station >which could get traps when the daemons die, arrange to have things >restarted, etc etc. This would have the advantage of letting you watch >--Ed Good idea, however SNMP daemons have been known to die too :-) I have a cron process which checks whether my SNMPD is alive. --pushpendra CERFNet
boyd@necisa.ho.necisa.oz (Boyd Roberts) (08/20/90)
In article <7769@gollum.twg.com> david@twg.com (David S. Herron) writes: > >This process will somehow take a list of processes to watch after. >It will be the parent of all those processes, so that it will be notified >of them dieing .. It will have a number of actions it can do when >the process dies, like wait awhile before starting a new copy, start >one immediately, start one under some condition, etc. > It's been done already. Back in '83 or so Tim Long% at Sydney Uni Comp Sci rewrote init so it was far more flexible as a general purpose daemon controller. He had a file /etc/procs with entries like this: tty-console /etc/login@ peb1200 /dev/console netd-basser40 /usr/spool/ACSnet/_lib/NNdaemon -I basser40 skulk /etc/skulk The first field was a handle for the process and the other fields were the program to run and its arguments. All daemons were started by init and a naming convention was used so that a group of related processes could be controlled easily. There was no concept of init `state'. But you could interrogate init and ask it what was going on. To interrogate it you used a program called `toinit': toinit <command> <regular-expressions...> The commands were (from what I can remember): start - start it stop - SIGTERM it and don't restart it kill - SIGTERM then SIGKILL curtail - don't restart it when it dies status - tell me what the state of world is scanprocs - re-read /etc/procs and incorporate any changes The regular expressions were matched against the first /etc/procs field (the handle for the process) and the appropriate action was taken on any of the matches. There were special entries in /etc/procs for a single user shell on the console for boot & shutdown. Startup was just a script that had the appropriate mounts and then a large `toinit start ...'. Shutdown was just a `toinit stop tty-.* ...' and then some magic (I forget) to get a single user shell on the console (these machines were 32V VAX 11/780's). There were some bugs, but we fixed them and hacked in some more magic for auto-reboots. The `magic' was usually just a `rc' like script that did the right things and then told init to start the appropriate stuff. With this approach you could control a _single_ entry, unlike the ghastly mess that is System V's /etc/inittab. The IPC between `toinit' and `init' was a bit messy, but with a mounted process stream implementation (was this ever done John?) it can be done really cleanly. Boyd Roberts boyd@necisa.ho.necisa.oz.au ``When the going gets wierd, the weird turn pro...'' ------- % Bruce Ellis, Piers Lauder, John Mackin, Chris Maltby and myself added mods and bug fixes over the years. @ getty/login were re-written into /etc/login. /bin/login was unlinked.
dpk@Morgan.COM (Doug Kingston) (08/27/90)
AIX has something like you mentioned wanting to write. Try finding some documentation for an RS6000 that describes it. -Doug-
stacy@sobeco.com (Stacy L. Millions) (08/31/90)
In <EMV.90Aug17083504@urania.math.lsa.umich.edu> emv@math.lsa.umich.edu (Edward Vielmetti) writes: >In article <7769@gollum.twg.com> david@twg.com (David S. Herron) writes: > A quick hack would be to have a cron job on occasion which either > checks for the existance of critical processes & start's 'em up. Or > just start's em & lets the processes fight over how many of which kind > are to be running. Buuuut.. >Ummmm.... >I would think that the right way of managing these things would be to >embed into them some piece of SNMP (the Simple Network Management Protocol), >and then have them all watched over by a network management station >which could get traps when the daemons die, arrange to have things >restarted, etc etc. This would have the advantage of letting you watch >over a bunch of systems from a single vantage point if you wanted to. The problem with this is, what happens when snmpd or one of its friends dies? It would be hard for snmpd to restart snmpd when snmpd is already dead :-). Serouisly, you don't want to have to build support for snmp into all of your deamons, you want snmp to support your deamons, and if your snmp deamon dies - your sunk. I think I shall tack David's daemon manager onto my TODO list. Unfortunately, my TODO list has me booked into the next century (I plan to take off the years 1999-2001, I don't want to be anywhere near a computer when the century rolls :-). -stacy -- "Eat any good books lately?" uunet!sobeco!stacy - 'Q' _Star Trek, The Next Generation_ stacy@sobeco.com
david@twg.com (David S. Herron) (09/01/90)
In article <1990Aug30.182942.21342@sobeco.com> stacy@sobeco.com (Stacy L. Millions) writes: >The problem with this is, what happens when snmpd or one of its >friends dies? It would be hard for snmpd to restart snmpd when >snmpd is already dead :-). Serouisly, you don't want to have to >build support for snmp into all of your deamons, you want snmp >to support your deamons, and if your snmp deamon dies - your sunk. Yes, exactly.. my idea was to have a very simple generic daemon for monitoring other daemons. Hopefully this would also translate into small size which ought to make it easier to debug and make sure that *IT* was a "safe" program unlikely to crash, etc. The System V init with /etc/inittab is close. It doesn't do some of the things I want. For instance if a process normally exits occasionally it might be good to check its exit status to see if it's a "real" problem or not, and act differently. In any case there should be logging to some place, be it e-mail or syslog or whatever. I'm reasonably sure that the SysV init doesn't log anywhere .. after all, it's for running getty's and why would you want to know when they exit? Oh.. and obviously you have to rewrite all those daemons so they don't do anti-social things like detach from the controlling process. >I think I shall tack David's daemon manager onto my TODO list. >Unfortunately, my TODO list has me booked into the next century >(I plan to take off the years 1999-2001, I don't want to be >anywhere near a computer when the century rolls :-). My TODO list is just as long, believe me ... I did hear of two things. One in SysVr4 (haven't bothered to look this up in the documentation yet..) is the Service Access Facility. The other is from either Cornell or CMU (I forget, and I've deleted that piece of mail) is a program called "nanny". *great* name at least! -- <- David Herron, an MMDF & WIN/MHS guy, <david@twg.com> <- Formerly: David Herron -- NonResident E-Mail Hack <david@ms.uky.edu> <- <- Sign me up for one "I survived Jaka's Story" T-shirt!
drake@drake.almaden.ibm.com (09/05/90)
There's also the System Resource Controller in AIX Version 3 for the RISC System/6000. From the documenation, here's a short description: The System Resource Controller (SRC) is a subsystem controller designed to minimize the need for operator intervention. It provides a mechanism to control subsystem processes using a common command line and the C interface. This includes the following: * Notification provided upon abnormal system termination of related processes * Tracing of a subsystem, a group of subsystems, or a subserver * Consistent user interface for start, stop, statusinquiries * Support for control of operations on a remote system * Logging abnormal termination of subsystems. The System Resource Controller is useful if you want a common way to start, stop, and collect status on processes. Sam Drake / IBM Almaden Research Center Internet: drake@ibm.com BITNET: DRAKE at ALMADEN Usenet: ...!uunet!ibmarc!drake Phone: (408) 927-1861