[comp.bugs.4bsd] Kernel bug

mouse@mcgill-vision.UUCP (der Mouse) (05/11/87)

Index:	Kernel (probably sys/uipc_usrreq.c), mtXinu 4.3+NFS

Description:
	Under certain timing conditions, trying to connect() to a
	socket in the AF_UNIX domain, when the process that was
	listening to the socket has closed the socket (eg, died - when
	a process dies all its file descriptors get closed), will cause
	a panic: "trap type 8, code = d05904c2".

	I have no idea whether this is present in vanilla 4.3.  If so
	it is probably in a different form.  Certainly the fix I found
	is specific to mtXinu's 4.3+NFS.

Repeat-By:
	Run the following program:

		#include <stdio.h>
		#include <signal.h>
		#include <sys/types.h>
		#include <sys/socket.h>
		#include <sys/un.h>
		
		char *sname = "/tmp/foo";
		int s;
		struct sockaddr_un sun;
		int childpid;
		int u1cnt;
		
		sigusr1()
		{
		 u1cnt ++;
		}
		
		main()
		{
		 sun.sun_family = AF_UNIX;
		 strcpy(sun.sun_path,sname);
		 u1cnt = 0;
		 signal(SIGUSR1,sigusr1);
		 childpid = fork();
		 switch (childpid)
		  { case -1:
		       perror("fork");
		       exit(1);
		       break;
		    case 0:
		       child();
		       break;
		    default:
		       parent();
		       break;
		  }
		}
		
		mkill(pid,sig)
		int pid;
		int sig;
		{
		 if (pid != 1) /* insurance....don't want to KILL init! */
		  { kill(pid,sig);
		  }
		}
		
		child()
		{
		 int c;
		 int s2;
		 int suns;
		
		 unlink(sname);
		 s = socket(AF_UNIX,SOCK_STREAM,0);
		 if (bind(s,&sun,sizeof(sun)) < 0)
		  { perror("bind");
		die:
		    sleep(1);
		    mkill(getppid(),SIGKILL);
		    sleep(1);
		    unlink(sname);
		    exit(0);
		  }
		 if (listen(s,10) < 0)
		  { perror("listen");
		    goto die;
		  }
		 kill(getppid(),SIGUSR1);
		 for (c=0;c<10;c++)
		  { suns = sizeof(sun);
		    s2 = accept(s,&sun,&suns);
		    if (s2 < 0)
		     { perror("accept");
		       goto die;
		     }
		    close(s2);
		  }
		 close(s);
		 sleep(10);
		 printf("child done\n");
		 exit(0);
		}
		
		parent()
		{
		 while (u1cnt == 0)
		  { pause();
		  }
		 while (1)
		  { s = socket(AF_UNIX,SOCK_STREAM,0);
		    if (connect(s,&sun,sizeof(sun)) < 0)
		     { perror("connect");
		die:
		       sleep(1);
		       mkill(childpid,SIGKILL);
		       sleep(1);
		       unlink(sname);
		       exit(0);
		     }
		    printf("@");
		    fflush(stdout);
		    close(s);
		  }
		}

	Watch it print ten @ signs and then watch your console print
	messages about trap type 8, panic segmentation fault.  I am not
	sure, but if your file systems are really busy (in terms of
	directory lookups per second) I think this might not work - try
	it single-user.

Fix:
	The problem appears to be that when the socket file descriptor
	is closed, the vnode is released with VN_RELE().  However, this
	does not clear the v_socket member of the struct vnode.  If the
	connect() is done soon enough that the vnode is found in the
	cache, v_socket will still be set, except it will be pointing
	to a struct socket that has had many important fields destroyed
	by unp_detach, the AF_UNIX close routine (I think the struct
	socket will also have been freed, but that doesn't matter at
	the moment).  I changed unp_detach (in sys/uipc_usrreq.c) from

		unp_detach(unp)
			register struct unpcb *unp;
		{
			
			if (unp->unp_vnode) {
				VN_RELE(unp->unp_vnode);
				unp->unp_vnode = 0;
			}

	to

		unp_detach(unp)
			register struct unpcb *unp;
		{
			
			if (unp->unp_vnode) {
				unp->unp_vnode->v_socket = 0;
				VN_RELE(unp->unp_vnode);
				unp->unp_vnode = 0;
			}

	With this fix, our system survived the above program 7 times
	out of 7; without the fix, it crashed 2 out of 2 times (I don't
	feel like trying lots of crashes).  This is not counting the
	crash (of a different system, same software) that made me start
	looking in the first place.

					der Mouse

				(mouse@mcgill-vision.uucp)