[comp.sys.sgi] Runaway named process

srp@babar.mmwb.ucsf.edu (Scott R. Presnell) (11/16/90)

Hi folks.

	I've got a curious problem with the name service that I can't seem
to pin down. So I thought I'd ask to see if anyone else is having this
problem.

	We run named out of the box on IRIX 3.3.1, our resolver is set to
look to the /etc/hosts file first, then named. We upgraded to 3.3.1 about 6
weeks ago.

Over the last week, I've had several cases of the named process "running
away:" gaining inordinate amounts of CPU time (in the thousands of minutes
as opposed to the normal one or two minues), and essentially becomming
useless for resolving remote hosts (ps and top show it to be in the "run"
state constantly).  None of configuration files have changed recently.
After I kill and restart the named, there is no problem, at least for a
while.  This has been happening on two different machines now (4D/2[05]G).

	I don't know if this is connected but, I've also noted a lot of
resolution failures with MAXQUERIES exceeded recently:

Nov 11 15:50:10 babar named[98]: MAXQUERIES exceeded, possible data loop in
	resolving (2.246.70.192.in-addr.arpa)
Nov 11 15:50:16 babar named[98]: MAXQUERIES exceeded, possible data loop in
	resolving (130.185.65.192.in-addr.arpa)
Nov 12 20:43:19 babar named[98]: MAXQUERIES exceeded, possible data loop in
	resolving (ifi.ethz.ch)
Nov 14 23:32:03 babar named[98]: MAXQUERIES exceeded, possible data loop in
	resolving (15.3.7.129.in-addr.arpa)

	Anyone seen this sort of stuff?  Any clues?

	Is named in an infinite loop?

	Thanks for your help.

	- Scott Presnell




--
Scott Presnell				        +1 (415) 476-9890
Pharm. Chem., S-926				Internet: srp@cgl.ucsf.edu
University of California			UUCP: ...ucbvax!ucsfcgl!srp
San Francisco, CA. 94143-0446			Bitnet: srp@ucsfcgl.bitnet

karron@KARRON.MED.NYU.EDU (11/16/90)

I just (re) tested my resolv.conf setup, and again, nslookup reports
the failure of my nameserver. If it was using /etc/hosts, it would get
an answer back.

Here is my resolv.conf:

domain          med.nyu.edu
hostresorder    local bind
nameserver      0.0.0.0
nameserver      128.122.135.4   #med.nyu.edu
nameserver      128.122.128.2   #nyu.edu

Here are the results with the above resolv.conf:

karron:~:102nslookup
Default Server:  karron
Address:  0.0.0.0

> ls karron
*** Can't list domain karron: No response from server
> ls med.nyu.edu
*** Can't list domain med.nyu.edu: No response from server
> exit
karron:~:103

Here are the results if I comment out the line nameserver 0.0.0.0:

karron:~:101nslookup
Default Server:  mcclb0.med.nyu.edu
Address:  128.122.135.4

> ls med.nyu.edu
[mcclb0.med.nyu.edu]
 med.nyu.edu                    server = cmcl2.nyu.edu
 med.nyu.edu                    server = acf5.nyu.edu
 med.nyu.edu                    server = egress.nyu.edu
 med.nyu.edu                    server = mcclb0.med.nyu.edu
 localhost                      127.0.0.1
 mcclb0                         128.122.135.4
 free-135-1                     128.122.135.1
 mcmnc1                         128.122.135.2
 karron                         128.122.135.3
 mcmrm47                        128.122.139.47


.lots of stuff deleted...
 mcmrm48                        128.122.139.48
> exit
karron:~:102

It is the above property that leads me to believe that the /etc/hosts
is not queries, and that a local named BIND is required to get
service from a local resolver.

+-----------------------------------------------------------------------------+
| karron@nyu.edu (mail alias that will always find me)                        |
|                                         Dan Karron                          |
| . . . . . . . . . . . . . .             New York University Medical Center  |
| 560 First Avenue           \ \    Pager <1> (212) 397 9330                  |
| New York, New York 10016    \**\        <2> 10896   <3> <your-number-here>  |
| (212) 340 5210               \**\__________________________________________ |
| Please Note : Soon to move to dan@karron.med.nyu.edu 128.122.135.3  (Nov 1 )|
+-----------------------------------------------------------------------------+

srp@babar.mmwb.ucsf.edu (Scott R. Presnell) (11/20/90)

srp@babar.mmwb.ucsf.edu (I) write:

>Hi folks.

>	I've got a curious problem with the name service that I can't seem
>to pin down. So I thought I'd ask to see if anyone else is having this
>problem.
>Over the last week, I've had several cases of the named process "running
>away:" gaining inordinate amounts of CPU time (in the thousands of minutes

Just in case someone else runs into this, I'll answer my own question.
Turns out that I got hit by the bogus root nameservers that are making the
rounds. If you see these guys in a named_dump.db of named, you've been hit
too.

; Dumped at Fri Nov 16 08:58:43 1990
; --- Cache & Data ---
$ORIGIN .
.	602116	IN	NS	NS.NIC.DDN.MIL.

[...]
	18376	IN	NS	TELECOM.	; bad - does not exist
	18352	IN	NS	NEXTSVR.	; bad
	18352	IN	NS	MTECV1.		; bad
;
;

The affected hosts were secondary servers that forwarded requests.  My
fix was two fold:

	1) Don't be a named that forwards requests to a specific host (that,
	in my case, caused the cache to become contaminated).  

	2) You may also want to get bind4.8.3 from ucbarpa.Berkeley.EDU 
	(ha, ha, they lost!) and install the named part.  It takes no
	effort to get it up on the SGI, and because you have the source,
	you can insert code to warn you of cache changes and zone updates.
	It's also a more recent version than the one SGI ships.
	
	I'd be glad to help if anyone else bumps into this problem.

	- Scott Presnell
--
Scott Presnell				        +1 (415) 476-9890
Pharm. Chem., S-926				Internet: srp@cgl.ucsf.edu
University of California			UUCP: ...ucbvax!ucsfcgl!srp
San Francisco, CA. 94143-0446			Bitnet: srp@ucsfcgl.bitnet