[comp.protocols.tcp-ip.domains] bug in BSD tahoe res_send

jeffe@sandino.austin.ibm.com (Peter Jeffe 512.823.4091) (08/02/90)

I'm not sure if this is a bug, but in BSD tahoe res_send.c (6.21), the
following code looks fishy:

|	for (retry = _res.retry; retry > 0; retry--) {
|	   for (ns = 0; ns < _res.nscount; ns++) {
|... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
|#if	BSD >= 43
|			if (_res.nscount == 1 || retry == _res.retry) {
                                              !!!^^^^^^^^^^^^^^^^!!!
|				/*
|				 * Don't use connect if we might
|				 * still receive a response
|				 * from another server.
|				 */
|				if (connected == 0) {
|					if (connect(s, &_res.nsaddr_list[ns],
|					    sizeof(struct sockaddr)) < 0) {
|	[ #ifdef DEBUG ]
|						continue;
|					}
|					connected = 1;
|				}
|				if (send(s, buf, buflen, 0) != buflen) {
|	[ #ifdef DEBUG ]
|					continue;
|				}
|			} else {
|	[ connect() to no_addr (0.0.0.0) and sendto(_res.nsaddr_list[ns]) ]

The problem (?) is that while (retry == _res.retry) we only query the first
host, since we connect() the first time through, and then keep send()ing to
whomever we're connected to, until we decrement retry and fall through to
the sendto() call.  So if we've got 3 nameservers defined in resolv.conf,
and all of them are down, we query in the following order:

	1, 1, 1,  1, 2, 3,  1, 2, 3

Now, I don't know if this was intended, but it jes' don't seem right to
me.  Especially since the virtual-circuit code doesn't act the same.
Personally, I'd chuck the connect()/send() stuff and just do the sendto()
(or at least do it only on (_res.nscount == 1) if you really think it's
more efficient).  What say ye all?

-------------------------------------------------------------------------------
Peter Jeffe   ...uunet!cs.utexas.edu!ibmchs!auschs!sandino.austin.ibm.com!jeffe
        first they want a disclaimer, then they make you pee in a jar,
                   then they come for you in the night

skl@van-bc.wimsey.bc.ca (Samuel Lam) (08/03/90)

In article <3010@awdprime.UUCP>, jeffe@sandino.austin.ibm.com
 (Peter Jeffe 512.823.4091) wrote:
>I'm not sure if this is a bug, but in BSD tahoe res_send.c (6.21), the
>following code looks fishy:
...
>|			if (_res.nscount == 1 || retry == _res.retry) {
>                                             !!!^^^^^^^^^^^^^^^^!!!

(What a coincident!)  I also nailed this one within the last month,
as it had caused us grief around here for a long time and I finally
gotten around to it.  I think this is definitely a bug, but my
analysis is different from yours.

I fixed it by changing the problem line from

    if (_res.nscount == 1 || retry == _res.retry) {
                          **
to

    if (_res.nscount == 1 && retry == _res.retry) {
                          **

as this fix made sense to me at the time.  i.e. The condition become
"if we only have one name server, *then* if this is also our first
attempt to use it, then <do ...>".

I have talked to the Berkeley person in charge of 4.4 about this (Mike)
at the Vancouver IETF this week, and he commented that the 4.3-tahoe
BIND is a couple years old and suggested that I take a look at the
4.4-reno BIND (to be available shortly) to see if the bug is still
there before I post a note to the BIND mailing list about it.

...Sam
-- 
Internet: <skl@wimsey.bc.ca>	UUCP: {van-bc,ubc-cs,uunet}!wimsey.bc.ca!skl

earle@poseur.JPL.NASA.GOV (Greg Earle - Sun JPL on-site Software Support) (08/04/90)

In article <1199@van-bc.wimsey.bc.ca>, skl@van-bc.wimsey.bc.ca (Samuel
Lam) writes:
>>I'm not sure if this is a bug, but in BSD tahoe res_send.c (6.21), the
>>following code looks fishy:
>...
>>|			if (_res.nscount == 1 || retry == _res.retry) {
>>                                             !!!^^^^^^^^^^^^^^^^!!!
> 
>(What a coincidence!)  I also nailed this one within the last month,
>as it had caused us grief around here for a long time and I finally
>gotten around to it.  I think this is definitely a bug, but my
>analysis is different from yours.
> 
>I fixed it by changing the problem line from
> 
>    if (_res.nscount == 1 || retry == _res.retry) {
>                          **
>to
>    if (_res.nscount == 1 && retry == _res.retry) {
>                          **
>
>as this fix made sense to me at the time.  i.e. The condition become
>"if we only have one name server, *then* if this is also our first
>attempt to use it, then <do ...>".

Yes, this is a bug.  But I do not think that your fix is the correct one;
for a better fix, use the res_send.c that comes with any BIND 4.8.2 version
that you can find (I believe several versions of BIND that purport to being
`BIND 4.8.2.x' abound; perhaps someone could make a definitive statement on
the situation?).  I chose the one from University of Toronto; you can get it
via anonymous FTP from neat.AI.Toronto.EDU, in `utbind.tar.Z'.  Examine the
code and you will see where this problem is dealt with directly.  The
connect()-ed socket is dis-connect()-ed if the resolver gets ECONNREFUSED
from the first round of nameserver queries.

In any case, this topic is fodder for the BIND mailing list, not the
comp.protocols.tcp-ip.domains newsgroup.  Use FTP to ucbarpa.Berkeley.EDU to
retrieve the archives of the BIND mailing list to find the original discussion
of this problem, and the solution proposed by the person whose fix is in the
U. of Toronto version.

--
	- Greg Earle
	  Sun Microsystems, Inc. - JPL on-site Software Support
	  earle@poseur.JPL.NASA.GOV	(Direct)
	  earle@Sun.COM			(Indirect)