[comp.protocols.tcp-ip] named going into an infinite loop ...

mf@ircam.fr (Michel Fingerhut) (10/22/90)

Machine: DECsystem 5820 (RISC)
OS:      Ultrix 4.0 (Rev. 179)

Every once in a while (every 3-4 days), the name daemon starts eating CPU
time, goes to the top of the queue, and fills the syslog error message
table with messages of the form

	Oct 22 10:23:37 localhost: 93 named: accept: Too many open files

(one a second, approximately) until it is killed and/or chokes /usr/spool.
Upon restart, it works fine.  There is no apparent flood of requests prior
to that.

Does anyone have a suggestion on how to approach the problem?

Thanks,
Michael Fingerhut

tinguely@plains.NoDak.edu (Mark Tinguely) (10/24/90)

In article <1990Oct22.105209.28006@ircam.ircam.fr> mf@ircam.fr (Michel Fingerhut) writes:
>Machine: DECsystem 5820 (RISC)
>OS:      Ultrix 4.0 (Rev. 179)
>Every once in a while (every 3-4 days), the name daemon starts eating CPU
>time, goes to the top of the queue, and fills the syslog error message
>table with messages of the form
>	Oct 22 10:23:37 localhost: 93 named: accept: Too many open files


 Do you have machines that queries the name server by TCP rather than
 UDP? This can be found by using `netstat'. We had the same problem with
 a IBM 3090 querying our the BIND 4.8.1 (and earlier releases) nameserver.
 I am sure the Ultrix server is based upon BIND 4.8.

 About 7 months ago I posted the fix to this problem, and (though I did
 not check), I think a simular fix went into BIND 4.8.2. There are two
 problems, but both are based on the fact that TCP queries are queued.
 It is possible with the orginal BIND code, that these queries are not
 properly released as they sit waiting on a time queue. UDP resolutions
 are just discarded if they can not be resolved right away, and do not
 cause this problem.

 If you do not want to update your nameserver to BIND (boy did I find out
 this week how many people think I am a radical for running public-domain
 software [that works correctly]), then ask at DEC to update the server.

 Last week I removed my "diff" files for the BIND error (assuming these
 were picked up in BIND 4.8.3 located at ucbarpa.berrkeley.edu in the
 4.3 directory). I just quickly scanned the areas that I modified in
 the BIND 4.8.3 files and did not see the removal of queued TCP entries,
 but since I don't follow the BIND mailing list, they may have implemented
 the solution in a different fashion than I did (or did not pick the changes
 at all). If there is a need for the TCP BIND fixes, I can restore them
 to our anonymous ftp partition.
-- 
Mark Tinguely           North Dakota State University,  Fargo, ND  58105
  UUCP:       		...!uunet!plains!tinguely
  BITNET:      		tinguely@plains.bitnet
  INTERNET:   		tinguely@plains.NoDak.edu

HAROLD@UGA.CC.UGA.EDU (Harold Pritchett) (10/25/90)

On Mon, 22 Oct 90 10:52:09 GMT Michel Fingerhut said:
>Machine: DECsystem 5820 (RISC)
>OS:      Ultrix 4.0 (Rev. 179)
>
>Every once in a while (every 3-4 days), the name daemon starts eating CPU
>time, goes to the top of the queue, and fills the syslog error message
>table with messages of the form
>
>	Oct 22 10:23:37 localhost: 93 named: accept: Too many open files
>
>(one a second, approximately) until it is killed and/or chokes /usr/spool.
>Upon restart, it works fine.  There is no apparent flood of requests prior
>to that.

Boy, do I have news for you.  We had that same problem here for approx a
month!!  DEC looked at it, we sent them dumps, they remotely logged onto
our machine, and finally they told us what was wrong!  The "/etc/resolv.conf"
file was mis-configured.

Make SURE that the first nameserver entry in the file points to the loopback
address.  It should look something like this:

domain     your.domain.edu
nameserver 127.0.0.1

We fixed ours, and have not had the problem since and that has been over two
weeks.  We also found that before we fixed the file, named would not dump
cache or stats in response to a kill -INT or kill -IOT command, and this
seems to have fixed that also.

For more information, you may want to contact Therese Grise in the DEC
Nashua, NH office, or Larry Pruitt in Atlanta, GA at (404) 772 2665.

Harold C Pritchett         |  BITNET:  HAROLD@UGA
BITNET TechRep             |    ARPA:  harold@uga.cc.uga.edu
The University of Georgia  |
Athens, GA 30602           |    fido:  1:370/60
(404) 542-3135             |     Bbs:  SYSOP at (404) 354-0817