mouse@mcgill-vision.UUCP (der Mouse) (05/11/87)
Index: Kernel (probably sys/uipc_usrreq.c), mtXinu 4.3+NFS Description: Under certain timing conditions, trying to connect() to a socket in the AF_UNIX domain, when the process that was listening to the socket has closed the socket (eg, died - when a process dies all its file descriptors get closed), will cause a panic: "trap type 8, code = d05904c2". I have no idea whether this is present in vanilla 4.3. If so it is probably in a different form. Certainly the fix I found is specific to mtXinu's 4.3+NFS. Repeat-By: Run the following program: #include <stdio.h> #include <signal.h> #include <sys/types.h> #include <sys/socket.h> #include <sys/un.h> char *sname = "/tmp/foo"; int s; struct sockaddr_un sun; int childpid; int u1cnt; sigusr1() { u1cnt ++; } main() { sun.sun_family = AF_UNIX; strcpy(sun.sun_path,sname); u1cnt = 0; signal(SIGUSR1,sigusr1); childpid = fork(); switch (childpid) { case -1: perror("fork"); exit(1); break; case 0: child(); break; default: parent(); break; } } mkill(pid,sig) int pid; int sig; { if (pid != 1) /* insurance....don't want to KILL init! */ { kill(pid,sig); } } child() { int c; int s2; int suns; unlink(sname); s = socket(AF_UNIX,SOCK_STREAM,0); if (bind(s,&sun,sizeof(sun)) < 0) { perror("bind"); die: sleep(1); mkill(getppid(),SIGKILL); sleep(1); unlink(sname); exit(0); } if (listen(s,10) < 0) { perror("listen"); goto die; } kill(getppid(),SIGUSR1); for (c=0;c<10;c++) { suns = sizeof(sun); s2 = accept(s,&sun,&suns); if (s2 < 0) { perror("accept"); goto die; } close(s2); } close(s); sleep(10); printf("child done\n"); exit(0); } parent() { while (u1cnt == 0) { pause(); } while (1) { s = socket(AF_UNIX,SOCK_STREAM,0); if (connect(s,&sun,sizeof(sun)) < 0) { perror("connect"); die: sleep(1); mkill(childpid,SIGKILL); sleep(1); unlink(sname); exit(0); } printf("@"); fflush(stdout); close(s); } } Watch it print ten @ signs and then watch your console print messages about trap type 8, panic segmentation fault. I am not sure, but if your file systems are really busy (in terms of directory lookups per second) I think this might not work - try it single-user. Fix: The problem appears to be that when the socket file descriptor is closed, the vnode is released with VN_RELE(). However, this does not clear the v_socket member of the struct vnode. If the connect() is done soon enough that the vnode is found in the cache, v_socket will still be set, except it will be pointing to a struct socket that has had many important fields destroyed by unp_detach, the AF_UNIX close routine (I think the struct socket will also have been freed, but that doesn't matter at the moment). I changed unp_detach (in sys/uipc_usrreq.c) from unp_detach(unp) register struct unpcb *unp; { if (unp->unp_vnode) { VN_RELE(unp->unp_vnode); unp->unp_vnode = 0; } to unp_detach(unp) register struct unpcb *unp; { if (unp->unp_vnode) { unp->unp_vnode->v_socket = 0; VN_RELE(unp->unp_vnode); unp->unp_vnode = 0; } With this fix, our system survived the above program 7 times out of 7; without the fix, it crashed 2 out of 2 times (I don't feel like trying lots of crashes). This is not counting the crash (of a different system, same software) that made me start looking in the first place. der Mouse (mouse@mcgill-vision.uucp)