woods@ncar.ucar.edu (Greg Woods) (11/14/90)
(This first started last Wednesday and has continued through this morning) We are having a serious problem with our name servers that APPEARS to be related to bogus root server data that is coming in from God knows where. Our configuration is that we have a primary server (ncar.ucar.edu a.k.a. handies.ucar.edu) which is the known server for our domain and is queried from the outside. We also have a server unknown to the outside which is configured as a secondary and is used as the forwarder by most of our internal machines (some internal machines still forward to the primary due to inertia; I don't control every server here so some of them are slow to change over as I have asked). (In case it matters, both are BIND 4.8.2, the primary is a Sun 4/280 running Sun OS 4.0.3, and the secondary is a Microvax II (a.k.a. boat anchor :-) running Ultrix 3.1) What happens is that when a machine is rebooted (or named is restarted), it goes into an infinite loop burning tons of CPU time and refusing to answer queries. It also ignores all signals (except 9, of course) which makes debugging a real pain. Empirical evidence shows that every time this has happened, I find the following bogus root servers in both the primary and secondary servers' caches: (root) nameserver = MTECV1 (root) nameserver = TELECOM (root) nameserver = NEXTSVR These appear with no domains and with no corresponding A record which I suspect may be the root of the problem (pun not intended, I swear). If this junk is NOT in the cache, then name servers using one of these as the forwarder can be started fine. If this junk *is* present, then killing and restarting first the primary and then the secondary (which of course removes the junk) will allow other servers here to be restarted. Occasionally I also see "lbl.gov" show up as a root server, but if it is there without these other three, it does not seem to cause the problem to occur. It occurs to me that the probable reason for that is that lbl.gov is a legitimate name that can be looked up and an A record eventually found, even if it isn't really a root server. Has anyone else seen this? Does anyone have any idea what the &^$%#@! is going on? I am familiar with how the DNS works on an administrative and conceptual level, but I am not familiar with BIND on a source code level, nor does the rather cryptic output you get when you turn debugging on make a whole lot of sense to me (the latter is a consequence of the former, I expect). Before I dive into the source code, I'd like to ask: is there any reason why data about the root domain coming in from outside should EVER be believed and cached? Has anyone patched BIND to disallow this? Will I break the entire DNS if I do this here? :-) --Greg
rickert@mp.cs.niu.edu (Neil Rickert) (11/14/90)
In article <9163@ncar.ucar.edu> woods@ncar.UCAR.EDU (Greg Woods) writes: >We are having a serious problem with our name servers that APPEARS to be >related to bogus root server data that is coming in from God knows where. >Our configuration is that we have a primary server (ncar.ucar.edu a.k.a. >(....) > >What happens is that when a machine is rebooted (or named is restarted), it >goes into an infinite loop burning tons of CPU time and refusing to answer >(....) >queries. It also ignores all signals (except 9, of course) which makes >debugging a real pain. Empirical evidence shows that every time this has >happened, I find the following bogus root servers in both the primary and >secondary servers' caches: > >Has anyone else seen this? Does anyone have any idea what the &^$%#@! is going Probably lots of people have seen the bogus records. They didn't cause any loops on our system, but we did have to kill and restart named to remove them from the cache. (Come to think of it, maybe it is about time I rechecked the cache to see if they have reappeared). >Before I dive into the source code, I'd like to ask: is there any reason why >data about the root domain coming in from outside should EVER be believed >and cached? Has anyone patched BIND to disallow this? Will I break the entire >DNS if I do this here? :-) > If the data comes from one of the root servers, it should be believed and cached. (Alas, at one stage a root server was putting out these bogus records - it had been contaminated too). -- =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*= Neil W. Rickert, Computer Science <rickert@cs.niu.edu> Northern Illinois Univ. DeKalb, IL 60115. +1-815-753-6940
asp@uunet.UU.NET (Andrew Partan) (11/14/90)
In article <9163@ncar.ucar.edu>, woods@ncar.ucar.edu (Greg Woods) writes: > I find the following bogus root servers in both the primary and > secondary servers' caches: > > (root) nameserver = MTECV1 > (root) nameserver = TELECOM > (root) nameserver = NEXTSVR I was poking around in a dump of our named's cache and found the offending records. I also found A records for them: TELECOM. 230636 IN A 132.254.1.11 NEXTSVR. 242153 IN A 132.254.1.6 MTECV1. 242152 IN A 131.178.1.1 Now the rest of the hosts in 132.254 are in *.MTY.ITESM.MX. In fact, there is a MTECV1.MTY.ITESM.MX. with the same A record as the bogus MTECV1.: $ORIGIN MTY.ITESM.MX. mtecv2 49142 IN A 131.178.1.5 49142 IN HINFO "VAX-6310" "Ultrix" TECMTYVM 76031 IN A 131.178.1.7 76031 IN HINFO "IBM-4381" "VM_4.0" MTECV1 421405 IN A 131.178.1.1 ; 789 54794 IN A 129.117.4.2 ; 961 My guess is that someone at MTY.ITESM.MX. was setting up a zone and added an extra trailing . where he/she shouldn't have. The nameservers for ITESM.MX. are: ITESM.MX. 86400 NS mtecv1.mty.itesm.mx. ITESM.MX. 86400 NS emx.utexas.edu. And from the SOA, the responsible person is root@telecom.rzs.itesm.mx. --asp@uunet.uu.net (Andrew Partan)
del@thrush.mlb.semi.harris.com (Don Lewis) (11/15/90)
We picked up the Mexican triplets last week. First ADM.BRL.MIL (listed as a name server for 9.9.192.in-addr.arpa) referred us back to the root servers on a query for 1.9.9.192.in-addr.arpa. In the referral message, it listed LBL.GOV as one of the root servers. Our name server cached this information. Shortly thereafter, we queried LBL.GOV (because we now thought it was a root server) about ncstate.edu, and it responded with a delegation back to the root servers, and it listed TELECOM, MTECV1, and NEXTSRV1 in this list. Apparently we also got A records as well, since we then started sending queries to 131.178.1.1 (mtecv1.mty.itesm.mx). Relevent log entries follow: Nov 9 13:00:01 slopoke named[20874]: Root NS LBL.GOV received from 192.5.25.4 on query on name [1.9.9.192.in-addr.arpa] Nov 9 13:56:29 slopoke named[20874]: Root NS TELECOM received from 128.3.254.23 on query on name [ncstate.edu] Nov 9 13:56:29 slopoke named[20874]: Root NS NEXTSVR received from 128.3.254.23 on query on name [ncstate.edu] Nov 9 13:56:29 slopoke named[20874]: Root NS MTECV1 received from 128.3.254.23 on query on name [ncstate.edu] -- Don "Truck" Lewis Harris Semiconductor Internet: del@mlb.semi.harris.com PO Box 883 MS 62A-028 Phone: (407) 729-5205 Melbourne, FL 32901
kre@cs.mu.oz.au (Robert Elz) (11/19/90)
In article <9163@ncar.ucar.edu>, woods@ncar.ucar.edu (Greg Woods) writes: > What happens is that when a machine is rebooted (or named is restarted), it > goes into an infinite loop burning tons of CPU time and refusing to answer > queries. While several people have been hunting for the source of the trash in the DNS, I haven't seen an answer explaining why the loop ... I believe that what is happening is that BIND is using a UDP request to (one of more of) the servers in your root.cache, asking for a list of the root servers. It is expecting that the reply it gets will contain a list of NS records for '.', and at least, an address for one of them. What's happening with all of the trash NS's included, is that there is no space left in the UDP reply packet for any "additional info" records, and its those that contain the A recods corresponding to the NS's. Hence, you end up with a list of root NS's, but no idea how to actually reach any of them. At this point BIND goes nuts ... one could hope that it would just send additional queries to the server that replied with the list of NS's, explicitly asking for A's to match, or perhaps try a TCP connection to that server and ask for the root NS's again, something... But as been said before, BIND has many, many, problems. kre