[comp.unix.wizards] Is write

naim@eecs.nwu.edu (Naim Abdullah) (07/12/88)

Do UNIX semantics guarantee that write(2) calls will be "atomic" ?

Suppose, process A executes write(fd, "123", 3) and process B
executes write(fd, "456", 3) "concurrently". The file descriptor fd
is shared between them (the file was creat(2)'ed for writing by the
common parent of A and B). Does UNIX guarantee that the contents of
the descriptor will be "123456" or "456123" (depending on which of
A and B won the race) but never "124536" ? Does it make a difference
whether the descriptor is a pipe or a terminal or a disk file or a
tape drive or something else ?

		      Naim Abdullah
		      Dept. of EECS,
		      Northwestern University

		      Internet: naim@eecs.nwu.edu
		      Uucp: {oddjob, chinet, gargoyle}!nucsrl!naim

bzs@bu-cs.UUCP (07/13/88)

>Do UNIX semantics guarantee that write(2) calls will be "atomic" ?
>
>Suppose, process A executes write(fd, "123", 3) and process B
>executes write(fd, "456", 3) "concurrently". The file descriptor fd
>is shared between them (the file was creat(2)'ed for writing by the
>common parent of A and B). Does UNIX guarantee that the contents of
>the descriptor will be "123456" or "456123" (depending on which of
>A and B won the race) but never "124536" ? Does it make a difference
>whether the descriptor is a pipe or a terminal or a disk file or a
>tape drive or something else ?
>
>		      Naim Abdullah

There's another possibility which is hard to illustrate but more like
write(fd,"123456",6) and write(fd,"ABC",3) -> "123ABC456" (what I'm
trying to say is atomic in units of some page size or block boundary)
and yet another (infra vide.)

I think the result is indeterminate (undefined), a few experiments
here on various systems came up with some even stranger results than
the above (as if the file was just overwritten, more like "123ABC" for
the [expected] 9-char example above.)

Note to experimenters: try it with one buffer (single write) very
large, like 100K or more. With smallish buffers things tend to
serialize.

Anyhow, that's what file/record locking is all about (NFS seemed to
strictly serialize things, however, I'm not sure if that's guaranteed
at all but I can see how it is more likely, just luck probably?)

	-Barry Shein, Boston University

haahr@phoenix.Princeton.EDU (Paul Gluckauf Haahr) (07/13/88)

in article <11410005@eecs.nwu.edu> naim@eecs.nwu.edu (Naim Abdullah) writes:
> Do UNIX semantics guarantee that write(2) calls will be "atomic" ?

in general, no.  it depends on the implementation.
use some synchronization primitives or one byte writes only.
worse than just mixing up data, if two processes are pounding away
at a file, it data may be lost.  (see below)

> Suppose, process A executes write(fd, "123", 3) and process B
> executes write(fd, "456", 3) "concurrently". The file descriptor fd
> is shared between them (the file was creat(2)'ed for writing by the
> common parent of A and B). Does UNIX guarantee that the contents of
> the descriptor will be "123456" or "456123" (depending on which of
> A and B won the race) but never "124536" ? Does it make a difference
> whether the descriptor is a pipe or a terminal or a disk file or a
> tape drive or something else ?

some special file types may implement atomic writes.  notably, berkeley
sockets (at least under SunOS 3.5 and Ultrix 2.0) appear, empirically,
to be fully atomic.  this is probably to support "reliable" protocols
like tcp/ip.  (could someone who knows the tcp/ip protocol spec confirm
whether or not it requires atomic writes?)  as a side effect, pipes
when implemented by socketpair(2) seem atomic.

i don't know about system v streams, but version 9 streams and pipes
are non atomic, and warn about this on their manual pages (along with a
comment that a fast reader and slow writers can simulate atomicity).

there is a shell archive at the end of this article containing two
programs i have used to test the atomicity of writes.  the first,
write.c, creates two processes.  each writes n strings of A or B to
standard output, with the length and number of strings set from the
command line.  no synchronization is attempted.  the original process
writes strings of A, the child writes strings of B.  so, for example,
the output of "write 5 4" could be
	AAAAABBBBBAAAAABBBBBAAAAAAAAAABBBBBBBBBB
(where 5 is the length of the writes, and 4 is the number of strings)

count.c reads the output of write.c and counts the number of each
character (sort of like uniq -c for characters instead of lines).  the
shell script bufs translates the number of characters into number of
buffers, which will be fractional if there was a non-atomic write.
"count | bufs 5" for the previous data gives
       5   1        A
       5   1        B
       5   1        A
       5   1        B
      10   2        A
      10   2        B

where the problem comes in is larger buffers.  using large enough
writes gives fractional numbers in the second column, on an nfs or
nd filesystem.  on a sun, i have not been able to generate a partial
record, i.e. "124536" from the original article, with a local disk.

what i consider a more serious problem occurs much more frequently than
fractional writes.  data gets dropped.  this occurs with both local
and remote file systems.  using a "write 8193 15" to a local (smd) disk
with an 8192 byte filesystem blocksize on a sun (similar results were
seen on a vax) gave
   90123   11       A
    8193   1        B
    8193   1        A
    8193   1        	<<< empty
   16386   2        A
  114702   14       B
examining the file with od showed nul (0) characters in that area.  it
takes fewer writes to get a similar result repeatedly with an nfs or nd
filesystem.  what seems to be happening is that between the time one
process writes its data and when it updates the file pointer, the other
process gets scheduled to run.

to solve this problem, one would need to add locks or semaphores to
file table entries to guarantee exclusive access to the file pointers.
fortunately, the people who are doing (symmetric) multiprocessor unices
have to do this anyway.

paul haahr
princeton!haahr or haahr@princeton.edu

# to unbundle, sh this file
# bundled by haahr on dennis at Tue Jul 12 14:04:59 EDT 1988
# contents of bundle:
#	write.c
#	count.c
#	bufs
echo write.c >&2
sed 's/^-//' > write.c <<'end of write.c'
-#include <stdio.h>
-
-#define atoi(s)	(strtol((s), (char **) 0, 0))
-#define	streq(s, t)	(strcmp((s), (t)) == 0)
-
-extern char *malloc();
-extern long strtol();
-extern int strcmp();
-
-int main(argc, argv)
-	int argc;
-	char *argv[];
-{
-	int pid, wpid, i, c, n, bufsize;
-	char *buf;
-
-	if (argc != 3) {
-		fprintf(stderr, "usage: %s bufsize nwrites\n", argv[0]);
-		exit(1);
-	}
-	bufsize	= atoi(argv[1]);
-	n	= atoi(argv[2]);
-
-	if ((pid = fork()) == -1) {
-		perror("fork");
-		exit(1);
-	}
-
-	if (pid == 0)
-		c = 'B';
-	else
-		c = 'A';
-	if ((buf = malloc(bufsize)) == NULL) {
-		perror("malloc");
-		exit(1);
-	}
-	for (i = 0; i < bufsize; i++)
-		buf[i] = c;
-
-	for (i = 0; i < n; i++)
-		if (write(1, buf, bufsize) == -1) {
-			perror("write");
-			exit(1);
-		}
-
-	if (pid != 0)
-		do
-			if ((wpid = wait((int *) 0)) == -1) {
-				perror("wait");
-				exit(1);
-			}
-		while (wpid != pid);
-
-	return 0;
-}
end of write.c
echo count.c >&2
sed 's/^-//' > count.c <<'end of count.c'
-#include <stdio.h>
-
-main()
-{
-	int c, lastc, n = 0;
-	do
-		if ((c = getchar()) == lastc)
-			n++;
-		else {
-			if (n > 0) {
-				printf("%6d %c\n", n, lastc);
-			}
-			n = 1;
-			lastc = c;
-		}
-	while (c != EOF);
-}
end of count.c
echo bufs >&2
sed 's/^-//' > bufs <<'end of bufs'
-#! /bin/sh
-n=$1
-shift
-awk 'NF > 0 { printf "%8d   %-8.3g %s\n", $1, $1/'$n', $2 }' $*
end of bufs
chmod +x bufs

chris@mimsy.UUCP (Chris Torek) (07/13/88)

In article <23801@bu-cs.BU.EDU> bzs@bu-cs.BU.EDU (Barry Shein) writes:
>I think the result is indeterminate (undefined), a few experiments
>here on various systems came up with some even stranger results ....

It is certainly not well-defined.  4BSD makes writes to regular files
and block special files atomic by locking the inode across the write()
call.  Character devices like terminals tend to be written atomically
only when the number of characters written fits in a cblock.  Appending
by lseek/write (rather than FAPPEND) has a race between the lseek and
the write, and so forth.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

karish@denali.stanford.edu (Chuck Karish) (07/13/88)

in article <11410005@eecs.nwu.edu> naim@eecs.nwu.edu (Naim Abdullah) writes:
> Do UNIX semantics guarantee that write(2) calls will be "atomic" ?

	If the write() is to a pipe or to a FIFO, and the request is for
	[PIPE_BUF] bytes or fewer, the write() is guaranteed to be
	atomic by the SVID and by the POSIX 1003.1 draft standard.
	For other write()s, the behavior is undefined; you take your
	chances.  If you absolutely, positively need atomic write()s,
	cram your I/O through a pipe or set up your own locking scheme.

Chuck Karish	ARPA:	karish@denali.stanford.edu
		BITNET:	karish%denali@forsythe.stanford.edu
		UUCP:	{decvax,hplabs!hpda}!mindcrf!karish
		USPS:	1825 California St. #5   Mountain View, CA 94041

chris@mimsy.UUCP (Chris Torek) (07/13/88)

In article <3247@phoenix.Princeton.EDU> haahr@phoenix.Princeton.EDU
(Paul Gluckauf Haahr) writes:
>worse than just mixing up data, if two processes are pounding away
>at a file, it data may be lost.  (see below)

This is due to a bug in 4.2BSD (and probably others).  Looking at
the 4.2BSD code, it seems to be that uio->uio_offset is not set
unless FAPPEND, instead relying on the value set before the ILOCK()
(all in /sys/sys/sys_inode.c ino_rw()).  Two `simultaneous' writes
will cause the second to block in ILOCK(); meanwhile, fp->f_offset
will grow from the first writer, without uio->uio_offset growing
for the second writer when the first finally unlocks the inode.

>some special file types may implement atomic writes.  notably, berkeley
>sockets (at least under SunOS 3.5 and Ultrix 2.0) appear, empirically,
>to be fully atomic.  this is probably to support "reliable" protocols
>like tcp/ip.

I am not sure whether this is by design; it is due to the FIFO
behaviour of sleep() (see /sys/sys/uipc_socket.c sosend(), where more
is being sent than fits: if SS_NBIO nonblocking i/o, EWOULDBLOCK or
short write, else unlock socket buffer and sleep on it).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

metro@asi.UUCP (Metro T. Sauper) (07/13/88)

From article <23801@bu-cs.BU.EDU>, by bzs@bu-cs.BU.EDU (Barry Shein):
> 
>>Do UNIX semantics guarantee that write(2) calls will be "atomic" ?
>>
>>		      Naim Abdullah
> 
> I think the result is indeterminate (undefined), .......
> 
> 	-Barry Shein, Boston University

I believe that it also depends on the mode you opened the file with.

My Programmers reference manual for Unix SYstem V r3.1, indicates that if
you open a file with mode "a" (append) you are guarenteed to write to the
current end of the file whether or not someone else has appended to the file.


-- 
Metro T. Sauper, Jr.                              Assessment Systems, Inc.
Director, Remote Systems Development              210 South Fourth Street
(215) 592-8900                 ..!asi!metro       Philadelphia, PA 19106

boyd@basser.oz (Boyd Roberts) (07/14/88)

Ok guys, enough is enough.  Regular file writes are atomic,
courtesy of plock().  Pipe writes are atomic, courtesy of
plock(), provided the size of the write is less than the
MAXIMUM size (not the available size) of the data that
can be held in the pipe.

Directory _writes_ (yes, it's a side affect of namei())
are atomic, courtesy of plock().

Character special writes are ``atomic'' on cblock boundaries
for tty style devices.  That is, not truly atomic.

Sockets, ether-packets, etc _should_ be atomic in the expected way.
Read your protocol _spec_ right now!!

Stream writes (from memory... my memory, not RAM) should
be atomic due to allocb() allocating a large enough block
to contain the data.   A bit like pipes, really.

Block special writes are atomic on block boundaries.  You
are _supposed_ to do ``block'' transfers.

Yes, there are races, but there are atomic writes.

A word of warning, do not reply or post anything containing
any of the following ``words'':

	   NFS
	   RFS
	   POSIX
	   X/OPEN

We are discussing UNIX file system semantics.


Boyd Roberts			boyd@basser.cs.su.oz
				boyd@necisa.necisa.oz

``When the going gets wierd, the weird turn pro...''

naim@eecs.nwu.edu (Naim Abdullah) (07/14/88)

Chris Torek pointed out that the fact that concurrent writes may
result in loss of data, is due to a bug in 4.2BSD. The bug persists
in 4.3BSD (at least in Mt. Xinu's 4.3+NFS). 

However, I was able to solve the original problem by using O_APPEND
when I open(2)'ed the output file. This seems to result in atomic
writes. Thanks to haynes@ucscc.ucsc.edu for this suggestion. He also
pointed out that the 4.3bsd login(1) opens wtmp using O_APPEND as login faces
the same problem of multiple concurrent writers to the same wtmp
file (when many people are logging in and out at the same time).
I just checked our System V rel 3.1 sources and the system V login
fseeks before fwriting to wtmp so I imagine it will suffer from this
problem too.


		      Naim Abdullah
		      Dept. of EECS,
		      Northwestern University

		      Internet: naim@eecs.nwu.edu
		      Uucp: {oddjob, chinet, gargoyle}!nucsrl!naim

chris@mimsy.UUCP (Chris Torek) (07/14/88)

In article <11410006@eecs.nwu.edu> naim@eecs.nwu.edu (Naim Abdullah) writes:
>Chris Torek pointed out that the fact that concurrent writes may
>result in loss of data, is due to a bug in 4.2BSD. The bug persists
>in 4.3BSD (at least in Mt. Xinu's 4.3+NFS). 

Sure enough, it does.  (I should have checked my own RCS files...)  I
copied the fix for this in January.  Here it is:

[/sys/sys/sys_inode.c]
*** /tmp/,RCSt1015268	Thu Jul 14 11:08:14 1988
--- /tmp/,RCSt2015268	Thu Jul 14 11:08:16 1988
***************
*** 36,49 ****
  {
  	register struct inode *ip = (struct inode *)fp->f_data;
! 	int error;
  
! 	if ((ip->i_mode&IFMT) == IFREG) {
  		ILOCK(ip);
! 		if (fp->f_flag&FAPPEND && rw == UIO_WRITE)
! 			uio->uio_offset = fp->f_offset = ip->i_size;
! 		error = rwip(ip, uio, rw);
  		IUNLOCK(ip);
- 	} else
- 		error = rwip(ip, uio, rw);
  	return (error);
  }
--- 36,53 ----
  {
  	register struct inode *ip = (struct inode *)fp->f_data;
! 	int count, error;
  
! 	if ((ip->i_mode&IFMT) != IFCHR)
  		ILOCK(ip);
! 	if ((ip->i_mode&IFMT) == IFREG &&
! 	    (fp->f_flag&FAPPEND) &&
! 	    rw == UIO_WRITE)
! 		fp->f_offset = ip->i_size;
! 	uio->uio_offset = fp->f_offset;
! 	count = uio->uio_resid;
! 	error = rwip(ip, uio, rw);
! 	fp->f_offset += count - uio->uio_resid;
! 	if ((ip->i_mode&IFMT) != IFCHR)
  		IUNLOCK(ip);
  	return (error);
  }
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris