[comp.bugs.4bsd] concurrent write

jas@rtech.UUCP (03/04/87)

Has anyone ever encountered this one before?

The small program below forks, and then both parent and child
write the integers from 0 to 20000, one to a line, to stdout.
The parent prefixes each integer with 'P', the child with 'C'.
A separate write(2) call is used for each line, so stdio buffering
doesn't figure in here.

What we should all expect to see in the output file (remember, parent
and child share a seek pointer) is some interleaving of the parent's
output lines and the child's.  But on many 4.2-derived systems (see below),
about 20 of the 40,000 total write(2) calls result in nulls being
written to the file, instead of the data pointed at in the write(2) call.

Repeat-by:  Run the program below, directing stdout into a file.
Check the output file for nulls.

This bug appears to exist only on 4.2-derived systems.  So far, I have
found it on the following machines:

Microvax running Ultrix
CCI Power-6 running CCI's 4.2bsd port
CCI Power-6 running CCI's System V (internally, it's still derived from 4.2)
Sequent Balance/whatever
Pyramid 90x
Sun 3/whatever

The bug is NOT on:

AT&T 3B15 and 3B20 running you-know-what

I think I'm also running into a variant of this problem involving
spurious nulls being written to a pipe when a signal occurs at just
the wrong time, and another pipe write is done in the signal handler.
I haven't been able to duplicate that one (yet) in a simple test case, though.

Any of you kernel types care to dig into this one?  Otherwise, I'll have
to, sooner or later.

------------------------ BEGIN TEST PROGRAM -------------------------------
/*
** Do concurrent write(2) calls to the same file; on lots of
** 4.2-derived systems, bad data shows up on the file.
*/

main()

{
    register int seqno = 0;
    register int len;
    register int pid;
    char buf[32];

    if ( ( pid = fork() ) < 0 )
    {
	perror( "fork" );
	exit( 1 );
    }
    for ( seqno = 0; seqno != 20000; ++seqno )
    {
	sprintf( buf, "%c%d\n", pid ? 'P' : 'C', seqno );
	len = strlen( buf );
	if ( write( 1, buf, len ) != len )
	    perror( "write" );
    }
    exit( 0 );
}
------------------------ END TEST PROGRAM -------------------------
-- 
Jim Shankland
 ..!ihnp4!cpsc6a!\
                  rtech!jas
..!ucbvax!mtxinu!/

guy%gorodish@Sun.COM (Guy Harris) (03/06/87)

>This bug appears to exist only on 4.2-derived systems.

Well, I don't know about that.  You see, it's like this:

Process A does a "write" call.  It grabs the current value of the
file pointer and uses it as the write offset.  It then locks the
inode and goes in to write stuff.  The write requires a new block to
be allocated.  This may require I/O to be done; assume it does.  The
process blocks waiting for the I/O to complete, and process B gets
scheduled.

Since process A's "write" hasn't finished, the file pointer has NOT
been updated.  It grabs the same offset value that process A got.  It
can't write yet, though, because the inode is locked.  So it waits.

Process A now finishes its I/O and finishes the "write".  It unlocks
the inode and updates the file pointer by adding the number of bytes
it wrote.

Now assume that process A gives up the processor as soon as it returns from
the kernel, and process B gets the processor.  It now proceeds to
write *its* data *on top of* the data that process B wrote.   It
unlocks the inode, and returns, adding the number of bytes *it* wrote
to the file pointer.  Thus, the file pointer moves by the sum of the
number of bytes processes A and B wrote.

However, only the maximum of the two byte counts was actually written
to the file.  The file pointer now points some number of bytes *past*
the last byte written; the next "write" will write at that location,
leaving behind a hole filled with - you got it - zeroes.

This is borne out by

	1) the fact that in a test case I ran (the test program was
	   modified so that the parent counted *down* rather than *up*,
	   so that the parent and child would be more likely to be writing
	   different numbers of bytes), it clearly looked like the two
	   processes both tried to write a record to the *same*
	   location in the file - a location that started on a
	   512-byte boundary - and that the zeroes followed this
	   scrambled record

and

	2) the fact that when I changed the program to put the file
	   descriptor in forced-append mode (so that the writes
	   *never* overlap) the problem went away.

I don't see any obvious reason why this *couldn't* happen on any
UNIX system that didn't lock the file table entry while a write was
in progress, and no system I've worked with does so.  It may be that
due to the vagaries of the scheduler, and the amount of I/O done when
extending a file in small chunks, and things like that, it's *less
likely* to happen on a system using the V7 file system, but I don't
see that it's impossible on such a system.

In short, the problem is that UNIX has never been able to guarantee
that the file pointer is always valid; it's invalid while an I/O
operation is "in progress", but nothing prevents a process from using
the file pointer's value while it isn't valid.  The solution is
something like "use file locking" or "use forced append mode" or "use
something else that will keep a process from using the file pointer
value while a 'write' is in progress," assuming you can arrange that.

>I think I'm also running into a variant of this problem involving
>spurious nulls being written to a pipe when a signal occurs at just
>the wrong time, and another pipe write is done in the signal handler.

Not likely in 4.2BSD, since pipes don't go through the file system,
but go through the socket code.

edler@cmcl2.UUCP (Jan Edler) (03/10/87)

Our version of UNIX has locked the file table entry on every write
for several years now.  The only time it doesn't is when writing
to a "slow" device, like a terminal (and such devices don't usually
maintain the notion of "file position" anyway).  Nulls do not appear
in the output file when running the posted test program.

I don't see any really good reason for not handling this case correctly;
there is some overhead in getting the extra lock, but that doesn't
seem like a good enough reason to me.

Jan Edler
New York University, Ultracomputer Project
edler@nyu
cmcl2!edler

hutch@sdcsvax.UUCP (03/11/87)

<>
I presume O_APPEND on open?  If not, then life will get ugly.  O_APPEND
will make things *even* work for multiple things writing to the same
file, or atleast it has worked for me so far.

-- 
    Jim Hutchison   		UUCP:	{dcdwest,ucbvax}!sdcsvax!hutch
		    		ARPA:	Hutch@sdcsvax.ucsd.edu
2049 6d61 7320 6c65 2066 6572 7270 7365 6e65 6974 676e 6920 206e 6874 7369 6120
7472 6369 656c 202c 2049 6572 7270 7365 6e65 2074 6e6f 796c 6d20 7379 6c65 2e66

guy@gorodish.UUCP (03/12/87)

>I presume O_APPEND on open?

Or when the program starts; turning on O_APPEND with "fcntl" does the
same job, and is the only way to do it if you're writing to the
standard output (since you aren't the one who opened it).  Yes,
that's what I meant by "forced-append mode".

henry@utzoo.UUCP (Henry Spencer) (03/16/87)

> I don't see any obvious reason why this *couldn't* happen on any
> UNIX system that didn't lock the file table entry while a write was
> in progress...  It may be that ... it's *less
> likely* to happen on a system using the V7 file system, but I don't
> see that it's impossible on such a system.

Not impossible at all.  Running the posted test program on utzoo (a vanilla
V7), some NULs show up.  Guy's explanation sounds right.
-- 
"We must choose: the stars or	Henry Spencer @ U of Toronto Zoology
the dust.  Which shall it be?"	{allegra,ihnp4,decvax,pyramid}!utzoo!henry

jas@rtech.UUCP (Jim Shankland) (03/18/87)

Well, fresh mail made me want to dig at this again.

Guy Harris writes:
> I don't see any obvious reason why this *couldn't* happen on any
> UNIX system that didn't lock the file table entry while a write was
> in progress...  It may be that ... it's *less
> likely* to happen on a system using the V7 file system, but I don't
> see that it's impossible on such a system.

And Henry Spencer concurs:
> Not impossible at all.  Running the posted test program on utzoo (a vanilla
> V7), some NULs show up.  Guy's explanation sounds right.

But looking at my (probably ancient) System V source, it appears that
while the file table entry is indeed not locked, it is only read once
the inode is locked (lines 62-70 of sys2.c, in rdwr(), in my old copy
of the source).  Therefore, the bug can, in fact, NOT happen on System
V, and appears to be limited to 4.xbsd and its commercial derivatives,
and also V7 systems (are there any left besides utzoo?:-)) (DON'T answer
that; on second thought, I'll bet there are, and I'm going to hear from
all of them.)
-- 
Jim Shankland
 ..!ihnp4!cpsc6a!\
                  rtech!jas
..!ucbvax!mtxinu!/