[comp.protocols.tcp-ip.domains] Monitoring your nameserver

dan@SCI.CCNY.CUNY.EDU (Dan Schlitt) (08/14/90)

A few days ago I posted a question to the BIND list.  I got a few responses
but not as much as I expected.  This mailinglist is a little more active
than the BIND list so maybe I can get a few more ideas from it.

/dan

To: bind@ucbarpa.Berkeley.EDU
Subject: How do YOU tell if named has died?

Two events yesterday prompt me to ask the assembled multitude the
question in the subject line.  First, I discovered that while I was
away on vacation named on one of our hosts died and nobody here at the
time noticed its passing.  It didn't cause problems for anyone because
it is only doing DNS for our local domain and is one of two
secondaries.  The primary and the off-site secondary were both doing
their job.  But we might not have been so lucky.

Second, I read J. Van Bokkelen's new RFC (rfc1173) and was reminded of my
responsibilities (which I obviously had been shirking).

So how do folks arrange to get automatic notification in a timely way
when their nameserver software dies?  Answers for diverse hardware
running unix for me, but others may be interested in other cases.

/dan

mlindsey@x102c.harris-atd.com (Lindsey MS 04396) (08/15/90)

In article <9008141525.AA27754@sci.ccny.cuny.edu> dan@SCI.CCNY.CUNY.EDU (Dan Schlitt) writes:
>So how do folks arrange to get automatic notification in a timely way
>when their nameserver software dies?  Answers for diverse hardware
>running unix for me, but others may be interested in other cases.

We have a shell script that cron runs every minute to check named, log errors,
and restart named if applicable.  It's not very sexy, but it works ok.

Since the shell script is so short, I've included it below:
###################  cut here ####################################
#!/bin/sh
#  Program:     check_named
#  Purpose:     to ensure named it always running.
#  Code date:   3/2/90
#
#  Revision History:
#
#  Date         Name            Description of change
#
if [ -f /etc/named/core ]
  then
        rm /etc/named/core
fi
if [ -f /etc/named.pid ]
  then
        PID=`cat /etc/named.pid`
        ANSWER=`ps $PID | grep -c in.named`
        if [ "$ANSWER" = 0 ]
          then
                /usr/etc/in.named
                date >> /etc/named/named_downtime
        fi
  else
        /usr/etc/in.named
        date >> /etc/named/named_downtime
fi
exit 0
###################  cut here ####################################
"Waste your brain, wax your board, and pray for waves!"   Woody in E.G.A.E.
/earth is 98% full!  Please delete anyone you can!	 (anonymous)
$teve Lindsey		|-)	uunet!x102a!mlindsey
(407) 727-5893		:-)	mlindsey@x102a.ess.harris.com

david@twg.com (David S. Herron) (08/17/90)

In article <9008141525.AA27754@sci.ccny.cuny.edu> dan@SCI.CCNY.CUNY.EDU (Dan Schlitt) writes:
>Subject: How do YOU tell if named has died?

>So how do folks arrange to get automatic notification in a timely way
>when their nameserver software dies?  Answers for diverse hardware
>running unix for me, but others may be interested in other cases.

A quick hack would be to have a cron job on occasion which either
checks for the existance of critical processes & start's 'em up.  Or
just start's em & lets the processes fight over how many of which kind
are to be running.  Buuuut..

There's a generic problem with the way daemon's are done in Unix
whose issue is beyond `name service'.  That is that the daemons
are processes spun off into the background and not watched after.
[So therefore I'm cross-posting to unix-wizards..]  If `cron' dies
the system is just as crippled, though in a different way.  And
random people are just as likely to notice cron dieing as they
do when named dies now.

Something on my long and varied list of Things To Do (but haven't done
yet) is to write a program (name: respawn, or daemond) which watches
after generic processes.  As opposed to init which is suited to
watching after /etc/getty's.

This process will somehow take a list of processes to watch after.
It will be the parent of all those processes, so that it will be notified
of them dieing ..  It will have a number of actions it can do when
the process dies, like wait awhile before starting a new copy, start
one immediately, start one under some condition, etc.

This is different from init in that init is rather specific to
watching after getty's.  Even the SysV version of init .. though
the configurability of /etc/inittab gets close to what I have in mind.

This is different from inetd in that inetd is specific to network
services.  `cron' is not a network service, yet it also needs to
be watched over in this way.  Also inetd is suited to a situation
where it starts up a fresh process for each connection -- in the
particular case of named this is bad because named needs to be
running all the time.

At the moment we're relying on the hopefull assumption of a lack of
bugs in these background daemons.  (Where's some wood to knock on??)
-- 
<- David Herron, an MMDF weenie, <david@twg.com>
<- Formerly: David Herron -- NonResident E-Mail Hack <david@ms.uky.edu>
<-
<- Sign me up for one "I survived Jaka's Story" T-shirt!

emv@math.lsa.umich.edu (Edward Vielmetti) (08/17/90)

In article <7769@gollum.twg.com> david@twg.com (David S. Herron) writes:

   >So how do folks arrange to get automatic notification in a timely way
   >when their nameserver software dies?  Answers for diverse hardware
   >running unix for me, but others may be interested in other cases.

   A quick hack would be to have a cron job on occasion which either
   checks for the existance of critical processes & start's 'em up.  Or
   just start's em & lets the processes fight over how many of which kind
   are to be running.  Buuuut..

Ummmm....

I would think that the right way of managing these things would be to
embed into them some piece of SNMP (the Simple Network Management Protocol),
and then have them all watched over by a network management station
which could get traps when the daemons die, arrange to have things
restarted, etc etc.  This would have the advantage of letting you watch
over a bunch of systems from a single vantage point if you wanted to.

That said, I must confess that I don't know whether anyone has stuck
SNMP into bind (anyone?), whether there's a MIB defined, or anything
like that.  I know that Sun doesn't ship it in SunOS 4.0.3, and that
the DEC Ultrix 4.0 snmp/bind stuff appears to be amenable to such
treatment but hasn't been done.

--Ed

Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>

swatt@NOC.NET.YALE.EDU (Alan S. Watt) (08/17/90)

It seems obvious the only meaningful way to monitor a nameserver is
toss requests at it and check the results.  The Internet Rover package
you can get free from Merit has code to do this.  You can also
use or perhaps modify dig (Domain Internet Groper) to the same end.

I have Rover running to monitor all the campus nameservers.  Every
few minutes or so it will hit them with a "1.0.0.127.in-addr.arpa"
lookup request.  This of course doesn't tell you everything is working
properly, but if a nameserver can't manage this much, it's time to
look at it.

So far, I haven't seen any nameserver failures which can still
pass this test.

	- Alan S. Watt
	  High Speed Networking, Yale University
	  Computing and Information Systems
	  Box 2112 Yale Station
	  New Haven, CT  06520-2112
	  (203) 432-6600 X394
	  Watt-Alan@Yale.Edu


Disclaimer:  "Make Love, Not War -- Be Prepared For Both"
		- Edelman's Sporting Goods [and Marital Aids?]

dan@SCI.CCNY.CUNY.EDU (Dan Schlitt) (08/18/90)

	
	   A quick hack would be to have a cron job on occasion which either
	   checks for the existance of critical processes & start's 'em up.  Or
	   just start's em & lets the processes fight over how many of which kind
	   are to be running.  Buuuut..
	
I think starting multiple nameds is a bad idea.  The second one finds
that the port is busy and hangs around doing nothing worthwhile.

	Ummmm....
	
	I would think that the right way of managing these things would be to
	embed into them some piece of SNMP (the Simple Network Management Protocol),
	and then have them all watched over by a network management station
	which could get traps when the daemons die, arrange to have things
	restarted, etc etc.  This would have the advantage of letting you watch
	over a bunch of systems from a single vantage point if you wanted to.
	
	That said, I must confess that I don't know whether anyone has stuck
	SNMP into bind (anyone?), whether there's a MIB defined, or anything
	like that.  I know that Sun doesn't ship it in SunOS 4.0.3, and that
	the DEC Ultrix 4.0 snmp/bind stuff appears to be amenable to such
	treatment but hasn't been done.
	
	--Ed
	
Ed,

That is the kind of thing that I was hoping to find out about.
Sending signals to named and using the returned status lacks
something in elegance.

/dan
	Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>

yenbut@CS.WASHINGTON.EDU (08/18/90)

Alan S. Watt writes:
> So far, I haven't seen any nameserver failures which can still
> pass this test.

We have seen a problem at least once when only one domain name is
checked.  We used to check only a domain name at the top of our
authoritative zone, "CS.Washington.EDU", but one day
"Washington.EDU" which we do zone transfer from a campus
nameserver for some reasons quit working.

Now, we check both zones once a day (We depend on nameservers
backing up one another.  Our resolver code sends a query to one
nameserver at a time.  It waits a few seconds for an answer from
the first nameserver before sending the query to the second one.
We know almost right away when the first nameserver on a machine
we are using is down by a feeling of delay).  We don't check on
"1.0.0.127.in-addr.arpa".

   Voradesh Yenbut	<yenbut@cs.washington.edu>
   CSE Department, Univ of Washington, Seattle, WA

pushp@nic.cerf.net (Pushpendra Mohta) (08/18/90)

In article <EMV.90Aug17083504@urania.math.lsa.umich.edu> emv@math.lsa.umich.edu (Edward Vielmetti) writes:
>Ummmm....
>I would think that the right way of managing these things would be to
>embed into them some piece of SNMP (the Simple Network Management Protocol),
>and then have them all watched over by a network management station
>which could get traps when the daemons die, arrange to have things
>restarted, etc etc.  This would have the advantage of letting you watch
>--Ed

Good idea, however SNMP daemons have been known to die too :-)
I have a cron process which checks whether my SNMPD is alive.

--pushpendra
CERFNet

boyd@necisa.ho.necisa.oz (Boyd Roberts) (08/20/90)

In article <7769@gollum.twg.com> david@twg.com (David S. Herron) writes:
>
>This process will somehow take a list of processes to watch after.
>It will be the parent of all those processes, so that it will be notified
>of them dieing ..  It will have a number of actions it can do when
>the process dies, like wait awhile before starting a new copy, start
>one immediately, start one under some condition, etc.
>

It's been done already.  Back in '83 or so Tim Long% at Sydney Uni
Comp Sci rewrote init so it was far more flexible as a general
purpose daemon controller.

He had a file /etc/procs with entries like this:

tty-console	/etc/login@ peb1200 /dev/console
netd-basser40	/usr/spool/ACSnet/_lib/NNdaemon -I basser40
skulk		/etc/skulk

The first field was a handle for the process and the other fields
were the program to run and its arguments.  All daemons were started
by init and a naming convention was used so that a group of related
processes could be controlled easily.

There was no concept of init `state'.  But you could interrogate
init and ask it what was going on.  To interrogate it you used
a program called `toinit':

    toinit <command> <regular-expressions...>

The commands were (from what I can remember):

    start	- start it
    stop	- SIGTERM it and don't restart it
    kill	- SIGTERM then SIGKILL
    curtail	- don't restart it when it dies
    status	- tell me what the state of world is
    scanprocs	- re-read /etc/procs and incorporate any changes

The regular expressions were matched against the first /etc/procs field
(the handle for the process) and the appropriate action was taken
on any of the matches.

There were special entries in /etc/procs for a single user shell on the
console for boot & shutdown.  Startup was just a script that had the
appropriate mounts and then a large `toinit start ...'.  Shutdown
was just a `toinit stop tty-.* ...' and then some magic (I forget)
to get a single user shell on the console (these machines were 32V
VAX 11/780's).

There were some bugs, but we fixed them and hacked in some more
magic for auto-reboots.  The `magic' was usually just a `rc' like
script that did the right things and then told init to start
the appropriate stuff.

With this approach you could control a _single_ entry, unlike the
ghastly mess that is System V's /etc/inittab.  The IPC between
`toinit' and `init' was a bit messy, but with a mounted process
stream implementation (was this ever done John?) it can be
done really cleanly.

Boyd Roberts			boyd@necisa.ho.necisa.oz.au

``When the going gets wierd, the weird turn pro...''

-------
% Bruce Ellis, Piers Lauder, John Mackin, Chris Maltby and myself
  added mods and bug fixes over the years.

@ getty/login were re-written into /etc/login.  /bin/login was unlinked.

dpk@Morgan.COM (Doug Kingston) (08/27/90)

AIX has something like you mentioned wanting to write.  Try finding
some documentation for an RS6000 that describes it.

-Doug-

stacy@sobeco.com (Stacy L. Millions) (08/31/90)

In <EMV.90Aug17083504@urania.math.lsa.umich.edu> emv@math.lsa.umich.edu (Edward Vielmetti) writes:

>In article <7769@gollum.twg.com> david@twg.com (David S. Herron) writes:
>   A quick hack would be to have a cron job on occasion which either
>   checks for the existance of critical processes & start's 'em up.  Or
>   just start's em & lets the processes fight over how many of which kind
>   are to be running.  Buuuut..
>Ummmm....
>I would think that the right way of managing these things would be to
>embed into them some piece of SNMP (the Simple Network Management Protocol),
>and then have them all watched over by a network management station
>which could get traps when the daemons die, arrange to have things
>restarted, etc etc.  This would have the advantage of letting you watch
>over a bunch of systems from a single vantage point if you wanted to.

The problem with this is, what happens when snmpd or one of its
friends dies? It would be hard for snmpd to restart snmpd when
snmpd is already dead :-). Serouisly, you don't want to have to
build support for snmp into all of your deamons, you want snmp
to support your deamons, and if your snmp deamon dies - your sunk.

I think I shall tack David's daemon manager onto my TODO list.
Unfortunately, my TODO list has me booked into the next century
(I plan to take off the years 1999-2001, I don't want to be
anywhere near a computer when the century rolls :-).

-stacy

-- 
"Eat any good books lately?"                                 uunet!sobeco!stacy
    - 'Q' _Star Trek, The Next Generation_                     stacy@sobeco.com

david@twg.com (David S. Herron) (09/01/90)

In article <1990Aug30.182942.21342@sobeco.com> stacy@sobeco.com (Stacy L. Millions) writes:
>The problem with this is, what happens when snmpd or one of its
>friends dies? It would be hard for snmpd to restart snmpd when
>snmpd is already dead :-). Serouisly, you don't want to have to
>build support for snmp into all of your deamons, you want snmp
>to support your deamons, and if your snmp deamon dies - your sunk.

Yes, exactly..

my idea was to have a very simple generic daemon for monitoring other
daemons.  Hopefully this would also translate into small size which
ought to make it easier to debug and make sure that *IT* was a "safe"
program unlikely to crash, etc.

The System V init with /etc/inittab is close.  It doesn't do some of
the things I want.  For instance if a process normally exits occasionally
it might be good to check its exit status to see if it's a "real" problem
or not, and act differently.  In any case there should be logging to
some place, be it e-mail or syslog or whatever.  I'm reasonably sure that
the SysV init doesn't log anywhere .. after all, it's for running getty's
and why would you want to know when they exit?

Oh.. and obviously you have to rewrite all those daemons so they don't
do anti-social things like detach from the controlling process.

>I think I shall tack David's daemon manager onto my TODO list.
>Unfortunately, my TODO list has me booked into the next century
>(I plan to take off the years 1999-2001, I don't want to be
>anywhere near a computer when the century rolls :-).

My TODO list is just as long, believe me ...

I did hear of two things.  One in SysVr4 (haven't bothered to look this
up in the documentation yet..) is the Service Access Facility.

The other is from either Cornell or CMU (I forget, and I've deleted
that piece of mail) is a program called "nanny".  *great* name at least!
-- 
<- David Herron, an MMDF & WIN/MHS guy, <david@twg.com>
<- Formerly: David Herron -- NonResident E-Mail Hack <david@ms.uky.edu>
<-
<- Sign me up for one "I survived Jaka's Story" T-shirt!

drake@drake.almaden.ibm.com (09/05/90)

There's also the System Resource Controller in AIX Version 3 for the
RISC System/6000.  From the documenation, here's a short description:

The System Resource Controller (SRC) is a subsystem controller designed to 
minimize the need for operator intervention.  It provides a mechanism to 
control subsystem processes using a common command line and the C interface.  
This includes the following:

* Notification provided upon abnormal system termination of related processes

* Tracing of a subsystem, a group of subsystems, or a subserver

* Consistent user interface for start, stop, statusinquiries

* Support for control of operations on a remote system

* Logging abnormal termination of subsystems.

The System Resource Controller is useful if you want a common way to start, 
stop, and collect status on processes.


Sam Drake / IBM Almaden Research Center 
Internet:  drake@ibm.com            BITNET:  DRAKE at ALMADEN
Usenet:    ...!uunet!ibmarc!drake   Phone:   (408) 927-1861