joshua@athertn.Atherton.COM (Flame Bait) (12/04/89)
I've got a programming problem which I hope someone already has the answer for. It has already caused me much grief. (None of it good :-) I have a client/server program which works just fine on BSD 4.2 type systems (like SunOS 3.5), but it fails on BSD 4.3 type systems (like SunOS 4.0.3 and AIX 2.2.1). I already changed the select system call so that it uses FD_SET, FD_ISSET, fd_set, and friends. Are there any other 4.2/4.3 differences? The changes I made were very small. I used FD_SET instead of a bit set before the select call, FD_SETSIZE instead of sizeof(int) in the select call, and FD_ISSET instead of a bit test after the call. Is there anything else I need to change? Some other facts: this error happens after 3962+/-10 identical operations, and is very consistent. If I start a client and run 3000 operations, kill it and run second client for 3000 operations, then all is well. The client application looks like this: listen to a TCP connection repeat 4000 times: send a UDP packet recv a responce via TCP close the accepted TCP connection The reason for the UDP/TCP switch is that the server will respond using UDP if it will fit in one UDP packet; if not, TCP is used. To tickle the bug I need to make a huge number of UDP request/TCP responses. The server is writing to the client, and the client is in the select call waiting for the server, but they never make contact. They had made contact for the 3900 odd calls before this and they make contact on a BSD 4.2 machine. Weird. This bug seems far too consistent for a timing problem, and the client runs too many times for it to be running out of some resource like file descriptors. Things that I have tried and have failed: I replaced FD_SETSIZE with getdtablesize(). I got paranoid about writes only writing some of their data (I put in code to check the return value, and loop to write the rest, if needed.) I got paranoid about signals interupting my read/write calls. (AIX, where this first hit me, is mostly System V). I changed all my listen(sock,1) calls to listen(sock,5) calls. (For when the client was listening for a TCP response). I added a shutdown(sock,2) before closing the socket which the client accepts from the server. Another general question (hopefully unrelated): If a write call System V type UNIX returns -1 with errno==EINTR, what should be done? There is no way to know if part of the data did get written. Or, is it always safe to restart the call from scratch? I'm at wits end. Email or call with any ideas you have. Thanks. Joshua Levy joshua@atherton.com home:(415)968-3718 {decwrl|sun|hpda}!athertn!joshua work:(408)734-9822