[comp.protocols.tcp-ip.domains] BOGUS ROOT SERVERS!!

woods@ncar.ucar.edu (Greg Woods) (11/14/90)

(This first started last Wednesday and has continued through this morning)

We are having a serious problem with our name servers that APPEARS to be
related to bogus root server data that is coming in from God knows where.
Our configuration is that we have a primary server (ncar.ucar.edu a.k.a.
handies.ucar.edu) which is the known server for our domain and is queried
from the outside. We also have a server unknown to the outside which is
configured as a secondary and is used as the forwarder by most of our
internal machines (some internal machines still forward to the primary
due to inertia; I don't control every server here so some of them are slow
to change over as I have asked). (In case it matters, both are BIND 4.8.2,
the primary is a Sun 4/280 running Sun OS 4.0.3, and the secondary is
a Microvax II (a.k.a. boat anchor :-) running Ultrix 3.1)

What happens is that when a machine is rebooted (or named is restarted), it
goes into an infinite loop burning tons of CPU time and refusing to answer
queries. It also ignores all signals (except 9, of course) which makes
debugging a real pain. Empirical evidence shows that every time this has
happened, I find the following bogus root servers in both the primary and
secondary servers' caches:

(root)  nameserver = MTECV1
(root)  nameserver = TELECOM
(root)  nameserver = NEXTSVR

These appear with no domains and with no corresponding A record which I
suspect may be the root of the problem (pun not intended, I swear). If
this junk is NOT in the cache, then name servers using one of these as
the forwarder can be started fine.  If this junk *is* present, then
killing and restarting first the primary and then the secondary (which
of course removes the junk) will allow other servers here to be
restarted. Occasionally I also see "lbl.gov" show up as a root server,
but if it is there without these other three, it does not seem to cause
the problem to occur. It occurs to me that the probable reason for that
is that lbl.gov is a legitimate name that can be looked up and an A record
eventually found, even if it isn't really a root server.

Has anyone else seen this? Does anyone have any idea what the &^$%#@! is going
on? I am familiar with how the DNS works on an administrative and conceptual
level, but I am not familiar with BIND on a source code level, nor does
the rather cryptic output you get when you turn debugging on make a whole
lot of sense to me (the latter is a consequence of the former, I expect).
Before I dive into the source code, I'd like to ask: is there any reason why
data about the root domain coming in from outside should EVER be believed
and cached?  Has anyone patched BIND to disallow this? Will I break the entire
DNS if I do this here? :-)

--Greg

rickert@mp.cs.niu.edu (Neil Rickert) (11/14/90)

In article <9163@ncar.ucar.edu> woods@ncar.UCAR.EDU (Greg Woods) writes:
>We are having a serious problem with our name servers that APPEARS to be
>related to bogus root server data that is coming in from God knows where.
>Our configuration is that we have a primary server (ncar.ucar.edu a.k.a.
>(....)
>
>What happens is that when a machine is rebooted (or named is restarted), it
>goes into an infinite loop burning tons of CPU time and refusing to answer
>(....)
>queries. It also ignores all signals (except 9, of course) which makes
>debugging a real pain. Empirical evidence shows that every time this has
>happened, I find the following bogus root servers in both the primary and
>secondary servers' caches:
>
>Has anyone else seen this? Does anyone have any idea what the &^$%#@! is going

  Probably lots of people have seen the bogus records.  They didn't cause
any loops on our system, but we did have to kill and restart named to remove
them from the cache.  (Come to think of it, maybe it is about time I rechecked
the cache to see if they have reappeared).

>Before I dive into the source code, I'd like to ask: is there any reason why
>data about the root domain coming in from outside should EVER be believed
>and cached?  Has anyone patched BIND to disallow this? Will I break the entire
>DNS if I do this here? :-)
>
  If the data comes from one of the root servers, it should be believed and
cached.  (Alas, at one stage a root server was putting out these bogus
records - it had been contaminated too).

-- 
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
  Neil W. Rickert, Computer Science               <rickert@cs.niu.edu>
  Northern Illinois Univ.
  DeKalb, IL 60115.                                  +1-815-753-6940

asp@uunet.UU.NET (Andrew Partan) (11/14/90)

In article <9163@ncar.ucar.edu>, woods@ncar.ucar.edu (Greg Woods) writes:
> I find the following bogus root servers in both the primary and
> secondary servers' caches:
> 
> (root)  nameserver = MTECV1
> (root)  nameserver = TELECOM
> (root)  nameserver = NEXTSVR

I was poking around in a dump of our named's cache and found the
offending records.  I also found A records for them:
	TELECOM. 230636  IN      A       132.254.1.11
	NEXTSVR. 242153  IN      A       132.254.1.6
	MTECV1.  242152  IN      A       131.178.1.1

Now the rest of the hosts in 132.254 are in *.MTY.ITESM.MX.
In fact, there is a MTECV1.MTY.ITESM.MX. with the same A record as the
bogus MTECV1.:
$ORIGIN MTY.ITESM.MX.
mtecv2  49142   IN      A       131.178.1.5
	49142   IN      HINFO   "VAX-6310" "Ultrix"
TECMTYVM        76031   IN      A       131.178.1.7
	76031   IN      HINFO   "IBM-4381" "VM_4.0"
MTECV1  421405  IN      A       131.178.1.1     ; 789
	54794   IN      A       129.117.4.2     ; 961

My guess is that someone at MTY.ITESM.MX. was setting up a zone and
added an extra trailing . where he/she shouldn't have.

The nameservers for ITESM.MX. are:
	ITESM.MX.       86400   NS      mtecv1.mty.itesm.mx.
	ITESM.MX.       86400   NS      emx.utexas.edu.

And from the SOA, the responsible person is root@telecom.rzs.itesm.mx.

	--asp@uunet.uu.net (Andrew Partan)

del@thrush.mlb.semi.harris.com (Don Lewis) (11/15/90)

We picked up the Mexican triplets last week.  First ADM.BRL.MIL (listed
as a name server for 9.9.192.in-addr.arpa) referred us back to the root
servers on a query for 1.9.9.192.in-addr.arpa.  In the referral message,
it listed LBL.GOV as one of the root servers.  Our name server cached
this information.  Shortly thereafter, we queried LBL.GOV (because we
now thought it was a root server) about ncstate.edu, and it responded
with a delegation back to the root servers, and it listed TELECOM, MTECV1,
and NEXTSRV1 in this list.  Apparently we also got A records as well,
since we then started sending queries to 131.178.1.1 (mtecv1.mty.itesm.mx).

Relevent log entries follow:

Nov  9 13:00:01 slopoke named[20874]: Root NS LBL.GOV received from 192.5.25.4 on query on name [1.9.9.192.in-addr.arpa]

Nov  9 13:56:29 slopoke named[20874]: Root NS TELECOM received from 128.3.254.23 on query on name [ncstate.edu]
Nov  9 13:56:29 slopoke named[20874]: Root NS NEXTSVR received from 128.3.254.23 on query on name [ncstate.edu]
Nov  9 13:56:29 slopoke named[20874]: Root NS MTECV1 received from 128.3.254.23
on query on name [ncstate.edu]
-- 
Don "Truck" Lewis                      Harris Semiconductor
Internet:  del@mlb.semi.harris.com     PO Box 883   MS 62A-028
Phone:     (407) 729-5205              Melbourne, FL  32901

kre@cs.mu.oz.au (Robert Elz) (11/19/90)

In article <9163@ncar.ucar.edu>, woods@ncar.ucar.edu (Greg Woods) writes:
> What happens is that when a machine is rebooted (or named is restarted), it
> goes into an infinite loop burning tons of CPU time and refusing to answer
> queries.

While several people have been hunting for the source of the trash
in the DNS, I haven't seen an answer explaining why the loop ...

I believe that what is happening is that BIND is using a UDP request
to (one of more of) the servers in your root.cache, asking for a
list of the root servers.   It is expecting that the reply it gets
will contain a list of NS records for '.', and at least, an address
for one of them.

What's happening with all of the trash NS's included, is that there is
no space left in the UDP reply packet for any "additional info" records,
and its those that contain the A recods corresponding to the NS's.
Hence, you end up with a list of root NS's, but no idea how to
actually reach  any of them.

At this point BIND goes nuts ... one could hope that it would just
send additional queries to the server that replied with the list
of NS's, explicitly asking for A's to match, or perhaps try a TCP
connection to that server and ask for the root NS's again, something...

But as been said before, BIND has many, many, problems.

kre