[comp.unix.questions] 4.2 vs. 4.3 sockets

joshua@athertn.Atherton.COM (Flame Bait) (12/04/89)

I've got a programming problem which I hope someone already has the
answer for.  It has already caused me much grief. (None of it good :-)

I have a client/server program which works just fine on BSD 4.2 type
systems (like SunOS 3.5), but it fails on BSD 4.3 type systems (like
SunOS 4.0.3 and AIX 2.2.1).  I already changed the select system call
so that it uses FD_SET, FD_ISSET, fd_set, and friends.  

Are there any other 4.2/4.3 differences?  The changes I made were very 
small.  I used FD_SET instead of a bit set before the select call, FD_SETSIZE
instead of sizeof(int) in the select call, and FD_ISSET instead of a
bit test after the call.  Is there anything else I need to change?

Some other facts: this error happens after 3962+/-10 identical operations,
and is very consistent.  If I start a client and run 3000 operations,
kill it and run second client for 3000 operations, then all is well.
The client application looks like this:

listen to a TCP connection
repeat 4000 times:
    send a UDP packet
    recv a responce via TCP
    close the accepted TCP connection

The reason for the UDP/TCP switch is that the server will respond using
UDP if it will fit in one UDP packet; if not, TCP is used.  To tickle the
bug I need to make a huge number of UDP request/TCP responses.  The server
is writing to the client, and the client is in the select call waiting for 
the server, but they never make contact.  They had made contact for the 3900
odd calls before this and they make contact on a BSD 4.2 machine.  Weird.

This bug seems far too consistent for a timing problem, and the client runs
too many times for it to be running out of some resource like file descriptors.

Things that I have tried and have failed:
    I replaced FD_SETSIZE with getdtablesize().
    I got paranoid about writes only writing some of their data (I put in code
        to check the return value, and loop to write the rest, if needed.)
    I got paranoid about signals interupting my read/write calls.  (AIX, where
        this first hit me, is mostly System V).
    I changed all my listen(sock,1) calls to listen(sock,5) calls.  (For when
        the client was listening for a TCP response).
    I added a shutdown(sock,2) before closing the socket which the client
        accepts from the server.

Another general question (hopefully unrelated):
    If a write call System V type UNIX returns -1 with errno==EINTR, what
        should be done?  There is no way to know if part of the data did
        get written.  Or, is it always safe to restart the call from scratch?

I'm at wits end.  Email or call with any ideas you have.  Thanks.

Joshua Levy                          joshua@atherton.com  home:(415)968-3718
                        {decwrl|sun|hpda}!athertn!joshua  work:(408)734-9822