[net.unix-wizards] HELP: IPC bug which crashes our 4.2 systems

fouts@orville (Martin Fouts) (12/02/84)

     The attached two program compose the shortest example I can
specify of a bug which will cause 4.2bsd to go into an infinite loop.

     The first program, talk.c opens a socket, connects to the server
listening on that socket, sends it a message and then exits.  When used
with another program it works fine.  When used with the second program,
it causes the system to hang.

     The second program, willcrash.c opens a socket, binds that socket
to a stream address, listens on that socket for connections, and then
does an unrelated select.

     The select should not return until there is input on channel 0,
which is standard input.

     To cause the bug to happen:

     1) Compile the two programs:
	cc -g -o willcrash willcrash.c
	cc -g -o talk talk.c

     2) Run willcrash:
        willcrash
	
     3) Run talk:
        talk test hi 1

     4) Attempt to terminate willcrash, either by typing a CTRL-C, or
by doing a kill.  At this point, the system will go into an infinite
loop.

     Any help in solving this one would be greatly appreciated.

Thanks,

Marty
fouts@ames-nas
---------------------- talk.c ------------------------------------------------
/* talk.c -- unix domain experiment */
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>

extern	int errno;

main(argc,argv)
char *argv[];
{
	int	sock;			/* unix socket file descriptor */
	char	ofname[20];		/* unix socket name */
	char	*request;
	char	*crayp;
	int	junk;
	int	loop;
	struct	sockaddr socketname;

	/*
	 *	Crack the parameters.
	 */
	if (argc < 3) usagerr();
	crayp = argv[1];
	printf(" Attempting to use socket %s.\n", crayp);
	sock = socket(AF_UNIX,SOCK_STREAM,0);
	if (sock < 0) {
		perror("Talk can't open socket");
		exit(1);
	}
	socketname.sa_family = AF_UNIX;
	strcpy(socketname.sa_data,crayp);
	if (connect(sock,&socketname,sizeof(struct sockaddr)) < 0) {
	    close(sock);
	    perror("talk: Connect failed");
	    exit(1);
	}
	request = argv[2];

	loop = atoi(argv[3]);
	for (junk=0; junk < loop; junk++) {
	if (write(sock, request, strlen(request)) < 0) {
		perror("Talk can't send message");
		exit(1);
	}
	}
	close(sock);
}

/*
 *	Indicate a usage error and exit.
 */

usagerr()
{
	fprintf(stderr, "usage:  talk socket message count");
	exit(1);
}
-------------------- willcrash.c ----------------------------------------------
/*
 * This version will crash 4.2
 */
#include <sys/types.h>
#include <sys/socket.h>

main()

{
    int fd;
    struct sockaddr s1;
    int     ready = 1;

    fd = socket (AF_UNIX, SOCK_STREAM, 0);
    s1.sa_family = AF_UNIX;
    strcpy (s1.sa_data, "test");
    bind (fd, &s1, sizeof (struct sockaddr));
    listen (fd, 5);
    select (20, &ready, 0, 0, 0);
}


----------


----------

jim@haring.UUCP (12/12/84)

There was indeed a bug in early versions of 4.2 which caused this to happen, the
problem was trying to connect to a socket where the server process exited before
accepting the connection, various parts of the uipc code assumed that another
part would tidy up partially completed connects, and looped waiting for it to
happen. Unfortunately our system has changed so much that I cannot easily make
a diff for this bug, perhaps someone else out there has it handy (it has been
discussed in unix-wizards before, about a year ago)?.

Now, I know your examples are just 'shorts' designed to show the bug, but
perhaps they can be used to show a couple of things that are not clear in
the 'IPC primer' or anywhere else for that matter:

1) for the UN*X domain you should include <sys/un.h> and use 'sockaddr_un'
   instead of 'sockaddr';

2) the third argument to the 'connect' and 'bind' calls for the UN*X domain
   the size of the string which is the name of the socket plus the size of
   the 'sun_family' element of the 'sockaddr_un' structure, e.g.
	strlen(socketname.sun_path) + sizeof(socketname.sun_family)
   where 'sun_path' is the element of the 'sockaddr_un' structure which
   contains the name (and is 108 characters in maximum size);

3) the server process needs to do an 'accept' call for the connection to
   complete. This is, in fact why the program exhibits the panic, no accept
   is done to complete the connection. This is how I found the bug a long
   time ago.

Hope that helps, and also that someone can dig up the bug fix.

Good luck.

Jim McKie    Centrum voor Wiskunde en Informatica, Amsterdam    mcvax!jim