naim@eecs.nwu.edu (Naim Abdullah) (07/12/88)
Do UNIX semantics guarantee that write(2) calls will be "atomic" ? Suppose, process A executes write(fd, "123", 3) and process B executes write(fd, "456", 3) "concurrently". The file descriptor fd is shared between them (the file was creat(2)'ed for writing by the common parent of A and B). Does UNIX guarantee that the contents of the descriptor will be "123456" or "456123" (depending on which of A and B won the race) but never "124536" ? Does it make a difference whether the descriptor is a pipe or a terminal or a disk file or a tape drive or something else ? Naim Abdullah Dept. of EECS, Northwestern University Internet: naim@eecs.nwu.edu Uucp: {oddjob, chinet, gargoyle}!nucsrl!naim
bzs@bu-cs.UUCP (07/13/88)
>Do UNIX semantics guarantee that write(2) calls will be "atomic" ? > >Suppose, process A executes write(fd, "123", 3) and process B >executes write(fd, "456", 3) "concurrently". The file descriptor fd >is shared between them (the file was creat(2)'ed for writing by the >common parent of A and B). Does UNIX guarantee that the contents of >the descriptor will be "123456" or "456123" (depending on which of >A and B won the race) but never "124536" ? Does it make a difference >whether the descriptor is a pipe or a terminal or a disk file or a >tape drive or something else ? > > Naim Abdullah There's another possibility which is hard to illustrate but more like write(fd,"123456",6) and write(fd,"ABC",3) -> "123ABC456" (what I'm trying to say is atomic in units of some page size or block boundary) and yet another (infra vide.) I think the result is indeterminate (undefined), a few experiments here on various systems came up with some even stranger results than the above (as if the file was just overwritten, more like "123ABC" for the [expected] 9-char example above.) Note to experimenters: try it with one buffer (single write) very large, like 100K or more. With smallish buffers things tend to serialize. Anyhow, that's what file/record locking is all about (NFS seemed to strictly serialize things, however, I'm not sure if that's guaranteed at all but I can see how it is more likely, just luck probably?) -Barry Shein, Boston University
haahr@phoenix.Princeton.EDU (Paul Gluckauf Haahr) (07/13/88)
in article <11410005@eecs.nwu.edu> naim@eecs.nwu.edu (Naim Abdullah) writes: > Do UNIX semantics guarantee that write(2) calls will be "atomic" ? in general, no. it depends on the implementation. use some synchronization primitives or one byte writes only. worse than just mixing up data, if two processes are pounding away at a file, it data may be lost. (see below) > Suppose, process A executes write(fd, "123", 3) and process B > executes write(fd, "456", 3) "concurrently". The file descriptor fd > is shared between them (the file was creat(2)'ed for writing by the > common parent of A and B). Does UNIX guarantee that the contents of > the descriptor will be "123456" or "456123" (depending on which of > A and B won the race) but never "124536" ? Does it make a difference > whether the descriptor is a pipe or a terminal or a disk file or a > tape drive or something else ? some special file types may implement atomic writes. notably, berkeley sockets (at least under SunOS 3.5 and Ultrix 2.0) appear, empirically, to be fully atomic. this is probably to support "reliable" protocols like tcp/ip. (could someone who knows the tcp/ip protocol spec confirm whether or not it requires atomic writes?) as a side effect, pipes when implemented by socketpair(2) seem atomic. i don't know about system v streams, but version 9 streams and pipes are non atomic, and warn about this on their manual pages (along with a comment that a fast reader and slow writers can simulate atomicity). there is a shell archive at the end of this article containing two programs i have used to test the atomicity of writes. the first, write.c, creates two processes. each writes n strings of A or B to standard output, with the length and number of strings set from the command line. no synchronization is attempted. the original process writes strings of A, the child writes strings of B. so, for example, the output of "write 5 4" could be AAAAABBBBBAAAAABBBBBAAAAAAAAAABBBBBBBBBB (where 5 is the length of the writes, and 4 is the number of strings) count.c reads the output of write.c and counts the number of each character (sort of like uniq -c for characters instead of lines). the shell script bufs translates the number of characters into number of buffers, which will be fractional if there was a non-atomic write. "count | bufs 5" for the previous data gives 5 1 A 5 1 B 5 1 A 5 1 B 10 2 A 10 2 B where the problem comes in is larger buffers. using large enough writes gives fractional numbers in the second column, on an nfs or nd filesystem. on a sun, i have not been able to generate a partial record, i.e. "124536" from the original article, with a local disk. what i consider a more serious problem occurs much more frequently than fractional writes. data gets dropped. this occurs with both local and remote file systems. using a "write 8193 15" to a local (smd) disk with an 8192 byte filesystem blocksize on a sun (similar results were seen on a vax) gave 90123 11 A 8193 1 B 8193 1 A 8193 1 <<< empty 16386 2 A 114702 14 B examining the file with od showed nul (0) characters in that area. it takes fewer writes to get a similar result repeatedly with an nfs or nd filesystem. what seems to be happening is that between the time one process writes its data and when it updates the file pointer, the other process gets scheduled to run. to solve this problem, one would need to add locks or semaphores to file table entries to guarantee exclusive access to the file pointers. fortunately, the people who are doing (symmetric) multiprocessor unices have to do this anyway. paul haahr princeton!haahr or haahr@princeton.edu # to unbundle, sh this file # bundled by haahr on dennis at Tue Jul 12 14:04:59 EDT 1988 # contents of bundle: # write.c # count.c # bufs echo write.c >&2 sed 's/^-//' > write.c <<'end of write.c' -#include <stdio.h> - -#define atoi(s) (strtol((s), (char **) 0, 0)) -#define streq(s, t) (strcmp((s), (t)) == 0) - -extern char *malloc(); -extern long strtol(); -extern int strcmp(); - -int main(argc, argv) - int argc; - char *argv[]; -{ - int pid, wpid, i, c, n, bufsize; - char *buf; - - if (argc != 3) { - fprintf(stderr, "usage: %s bufsize nwrites\n", argv[0]); - exit(1); - } - bufsize = atoi(argv[1]); - n = atoi(argv[2]); - - if ((pid = fork()) == -1) { - perror("fork"); - exit(1); - } - - if (pid == 0) - c = 'B'; - else - c = 'A'; - if ((buf = malloc(bufsize)) == NULL) { - perror("malloc"); - exit(1); - } - for (i = 0; i < bufsize; i++) - buf[i] = c; - - for (i = 0; i < n; i++) - if (write(1, buf, bufsize) == -1) { - perror("write"); - exit(1); - } - - if (pid != 0) - do - if ((wpid = wait((int *) 0)) == -1) { - perror("wait"); - exit(1); - } - while (wpid != pid); - - return 0; -} end of write.c echo count.c >&2 sed 's/^-//' > count.c <<'end of count.c' -#include <stdio.h> - -main() -{ - int c, lastc, n = 0; - do - if ((c = getchar()) == lastc) - n++; - else { - if (n > 0) { - printf("%6d %c\n", n, lastc); - } - n = 1; - lastc = c; - } - while (c != EOF); -} end of count.c echo bufs >&2 sed 's/^-//' > bufs <<'end of bufs' -#! /bin/sh -n=$1 -shift -awk 'NF > 0 { printf "%8d %-8.3g %s\n", $1, $1/'$n', $2 }' $* end of bufs chmod +x bufs
chris@mimsy.UUCP (Chris Torek) (07/13/88)
In article <23801@bu-cs.BU.EDU> bzs@bu-cs.BU.EDU (Barry Shein) writes: >I think the result is indeterminate (undefined), a few experiments >here on various systems came up with some even stranger results .... It is certainly not well-defined. 4BSD makes writes to regular files and block special files atomic by locking the inode across the write() call. Character devices like terminals tend to be written atomically only when the number of characters written fits in a cblock. Appending by lseek/write (rather than FAPPEND) has a race between the lseek and the write, and so forth. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
karish@denali.stanford.edu (Chuck Karish) (07/13/88)
in article <11410005@eecs.nwu.edu> naim@eecs.nwu.edu (Naim Abdullah) writes: > Do UNIX semantics guarantee that write(2) calls will be "atomic" ? If the write() is to a pipe or to a FIFO, and the request is for [PIPE_BUF] bytes or fewer, the write() is guaranteed to be atomic by the SVID and by the POSIX 1003.1 draft standard. For other write()s, the behavior is undefined; you take your chances. If you absolutely, positively need atomic write()s, cram your I/O through a pipe or set up your own locking scheme. Chuck Karish ARPA: karish@denali.stanford.edu BITNET: karish%denali@forsythe.stanford.edu UUCP: {decvax,hplabs!hpda}!mindcrf!karish USPS: 1825 California St. #5 Mountain View, CA 94041
chris@mimsy.UUCP (Chris Torek) (07/13/88)
In article <3247@phoenix.Princeton.EDU> haahr@phoenix.Princeton.EDU (Paul Gluckauf Haahr) writes: >worse than just mixing up data, if two processes are pounding away >at a file, it data may be lost. (see below) This is due to a bug in 4.2BSD (and probably others). Looking at the 4.2BSD code, it seems to be that uio->uio_offset is not set unless FAPPEND, instead relying on the value set before the ILOCK() (all in /sys/sys/sys_inode.c ino_rw()). Two `simultaneous' writes will cause the second to block in ILOCK(); meanwhile, fp->f_offset will grow from the first writer, without uio->uio_offset growing for the second writer when the first finally unlocks the inode. >some special file types may implement atomic writes. notably, berkeley >sockets (at least under SunOS 3.5 and Ultrix 2.0) appear, empirically, >to be fully atomic. this is probably to support "reliable" protocols >like tcp/ip. I am not sure whether this is by design; it is due to the FIFO behaviour of sleep() (see /sys/sys/uipc_socket.c sosend(), where more is being sent than fits: if SS_NBIO nonblocking i/o, EWOULDBLOCK or short write, else unlock socket buffer and sleep on it). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
metro@asi.UUCP (Metro T. Sauper) (07/13/88)
From article <23801@bu-cs.BU.EDU>, by bzs@bu-cs.BU.EDU (Barry Shein): > >>Do UNIX semantics guarantee that write(2) calls will be "atomic" ? >> >> Naim Abdullah > > I think the result is indeterminate (undefined), ....... > > -Barry Shein, Boston University I believe that it also depends on the mode you opened the file with. My Programmers reference manual for Unix SYstem V r3.1, indicates that if you open a file with mode "a" (append) you are guarenteed to write to the current end of the file whether or not someone else has appended to the file. -- Metro T. Sauper, Jr. Assessment Systems, Inc. Director, Remote Systems Development 210 South Fourth Street (215) 592-8900 ..!asi!metro Philadelphia, PA 19106
boyd@basser.oz (Boyd Roberts) (07/14/88)
Ok guys, enough is enough. Regular file writes are atomic, courtesy of plock(). Pipe writes are atomic, courtesy of plock(), provided the size of the write is less than the MAXIMUM size (not the available size) of the data that can be held in the pipe. Directory _writes_ (yes, it's a side affect of namei()) are atomic, courtesy of plock(). Character special writes are ``atomic'' on cblock boundaries for tty style devices. That is, not truly atomic. Sockets, ether-packets, etc _should_ be atomic in the expected way. Read your protocol _spec_ right now!! Stream writes (from memory... my memory, not RAM) should be atomic due to allocb() allocating a large enough block to contain the data. A bit like pipes, really. Block special writes are atomic on block boundaries. You are _supposed_ to do ``block'' transfers. Yes, there are races, but there are atomic writes. A word of warning, do not reply or post anything containing any of the following ``words'': NFS RFS POSIX X/OPEN We are discussing UNIX file system semantics. Boyd Roberts boyd@basser.cs.su.oz boyd@necisa.necisa.oz ``When the going gets wierd, the weird turn pro...''
naim@eecs.nwu.edu (Naim Abdullah) (07/14/88)
Chris Torek pointed out that the fact that concurrent writes may result in loss of data, is due to a bug in 4.2BSD. The bug persists in 4.3BSD (at least in Mt. Xinu's 4.3+NFS). However, I was able to solve the original problem by using O_APPEND when I open(2)'ed the output file. This seems to result in atomic writes. Thanks to haynes@ucscc.ucsc.edu for this suggestion. He also pointed out that the 4.3bsd login(1) opens wtmp using O_APPEND as login faces the same problem of multiple concurrent writers to the same wtmp file (when many people are logging in and out at the same time). I just checked our System V rel 3.1 sources and the system V login fseeks before fwriting to wtmp so I imagine it will suffer from this problem too. Naim Abdullah Dept. of EECS, Northwestern University Internet: naim@eecs.nwu.edu Uucp: {oddjob, chinet, gargoyle}!nucsrl!naim
chris@mimsy.UUCP (Chris Torek) (07/14/88)
In article <11410006@eecs.nwu.edu> naim@eecs.nwu.edu (Naim Abdullah) writes: >Chris Torek pointed out that the fact that concurrent writes may >result in loss of data, is due to a bug in 4.2BSD. The bug persists >in 4.3BSD (at least in Mt. Xinu's 4.3+NFS). Sure enough, it does. (I should have checked my own RCS files...) I copied the fix for this in January. Here it is: [/sys/sys/sys_inode.c] *** /tmp/,RCSt1015268 Thu Jul 14 11:08:14 1988 --- /tmp/,RCSt2015268 Thu Jul 14 11:08:16 1988 *************** *** 36,49 **** { register struct inode *ip = (struct inode *)fp->f_data; ! int error; ! if ((ip->i_mode&IFMT) == IFREG) { ILOCK(ip); ! if (fp->f_flag&FAPPEND && rw == UIO_WRITE) ! uio->uio_offset = fp->f_offset = ip->i_size; ! error = rwip(ip, uio, rw); IUNLOCK(ip); - } else - error = rwip(ip, uio, rw); return (error); } --- 36,53 ---- { register struct inode *ip = (struct inode *)fp->f_data; ! int count, error; ! if ((ip->i_mode&IFMT) != IFCHR) ILOCK(ip); ! if ((ip->i_mode&IFMT) == IFREG && ! (fp->f_flag&FAPPEND) && ! rw == UIO_WRITE) ! fp->f_offset = ip->i_size; ! uio->uio_offset = fp->f_offset; ! count = uio->uio_resid; ! error = rwip(ip, uio, rw); ! fp->f_offset += count - uio->uio_resid; ! if ((ip->i_mode&IFMT) != IFCHR) IUNLOCK(ip); return (error); } -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris