jas@rtech.UUCP (03/04/87)
Has anyone ever encountered this one before? The small program below forks, and then both parent and child write the integers from 0 to 20000, one to a line, to stdout. The parent prefixes each integer with 'P', the child with 'C'. A separate write(2) call is used for each line, so stdio buffering doesn't figure in here. What we should all expect to see in the output file (remember, parent and child share a seek pointer) is some interleaving of the parent's output lines and the child's. But on many 4.2-derived systems (see below), about 20 of the 40,000 total write(2) calls result in nulls being written to the file, instead of the data pointed at in the write(2) call. Repeat-by: Run the program below, directing stdout into a file. Check the output file for nulls. This bug appears to exist only on 4.2-derived systems. So far, I have found it on the following machines: Microvax running Ultrix CCI Power-6 running CCI's 4.2bsd port CCI Power-6 running CCI's System V (internally, it's still derived from 4.2) Sequent Balance/whatever Pyramid 90x Sun 3/whatever The bug is NOT on: AT&T 3B15 and 3B20 running you-know-what I think I'm also running into a variant of this problem involving spurious nulls being written to a pipe when a signal occurs at just the wrong time, and another pipe write is done in the signal handler. I haven't been able to duplicate that one (yet) in a simple test case, though. Any of you kernel types care to dig into this one? Otherwise, I'll have to, sooner or later. ------------------------ BEGIN TEST PROGRAM ------------------------------- /* ** Do concurrent write(2) calls to the same file; on lots of ** 4.2-derived systems, bad data shows up on the file. */ main() { register int seqno = 0; register int len; register int pid; char buf[32]; if ( ( pid = fork() ) < 0 ) { perror( "fork" ); exit( 1 ); } for ( seqno = 0; seqno != 20000; ++seqno ) { sprintf( buf, "%c%d\n", pid ? 'P' : 'C', seqno ); len = strlen( buf ); if ( write( 1, buf, len ) != len ) perror( "write" ); } exit( 0 ); } ------------------------ END TEST PROGRAM ------------------------- -- Jim Shankland ..!ihnp4!cpsc6a!\ rtech!jas ..!ucbvax!mtxinu!/
guy%gorodish@Sun.COM (Guy Harris) (03/06/87)
>This bug appears to exist only on 4.2-derived systems. Well, I don't know about that. You see, it's like this: Process A does a "write" call. It grabs the current value of the file pointer and uses it as the write offset. It then locks the inode and goes in to write stuff. The write requires a new block to be allocated. This may require I/O to be done; assume it does. The process blocks waiting for the I/O to complete, and process B gets scheduled. Since process A's "write" hasn't finished, the file pointer has NOT been updated. It grabs the same offset value that process A got. It can't write yet, though, because the inode is locked. So it waits. Process A now finishes its I/O and finishes the "write". It unlocks the inode and updates the file pointer by adding the number of bytes it wrote. Now assume that process A gives up the processor as soon as it returns from the kernel, and process B gets the processor. It now proceeds to write *its* data *on top of* the data that process B wrote. It unlocks the inode, and returns, adding the number of bytes *it* wrote to the file pointer. Thus, the file pointer moves by the sum of the number of bytes processes A and B wrote. However, only the maximum of the two byte counts was actually written to the file. The file pointer now points some number of bytes *past* the last byte written; the next "write" will write at that location, leaving behind a hole filled with - you got it - zeroes. This is borne out by 1) the fact that in a test case I ran (the test program was modified so that the parent counted *down* rather than *up*, so that the parent and child would be more likely to be writing different numbers of bytes), it clearly looked like the two processes both tried to write a record to the *same* location in the file - a location that started on a 512-byte boundary - and that the zeroes followed this scrambled record and 2) the fact that when I changed the program to put the file descriptor in forced-append mode (so that the writes *never* overlap) the problem went away. I don't see any obvious reason why this *couldn't* happen on any UNIX system that didn't lock the file table entry while a write was in progress, and no system I've worked with does so. It may be that due to the vagaries of the scheduler, and the amount of I/O done when extending a file in small chunks, and things like that, it's *less likely* to happen on a system using the V7 file system, but I don't see that it's impossible on such a system. In short, the problem is that UNIX has never been able to guarantee that the file pointer is always valid; it's invalid while an I/O operation is "in progress", but nothing prevents a process from using the file pointer's value while it isn't valid. The solution is something like "use file locking" or "use forced append mode" or "use something else that will keep a process from using the file pointer value while a 'write' is in progress," assuming you can arrange that. >I think I'm also running into a variant of this problem involving >spurious nulls being written to a pipe when a signal occurs at just >the wrong time, and another pipe write is done in the signal handler. Not likely in 4.2BSD, since pipes don't go through the file system, but go through the socket code.
edler@cmcl2.UUCP (Jan Edler) (03/10/87)
Our version of UNIX has locked the file table entry on every write for several years now. The only time it doesn't is when writing to a "slow" device, like a terminal (and such devices don't usually maintain the notion of "file position" anyway). Nulls do not appear in the output file when running the posted test program. I don't see any really good reason for not handling this case correctly; there is some overhead in getting the extra lock, but that doesn't seem like a good enough reason to me. Jan Edler New York University, Ultracomputer Project edler@nyu cmcl2!edler
hutch@sdcsvax.UUCP (03/11/87)
<> I presume O_APPEND on open? If not, then life will get ugly. O_APPEND will make things *even* work for multiple things writing to the same file, or atleast it has worked for me so far. -- Jim Hutchison UUCP: {dcdwest,ucbvax}!sdcsvax!hutch ARPA: Hutch@sdcsvax.ucsd.edu 2049 6d61 7320 6c65 2066 6572 7270 7365 6e65 6974 676e 6920 206e 6874 7369 6120 7472 6369 656c 202c 2049 6572 7270 7365 6e65 2074 6e6f 796c 6d20 7379 6c65 2e66
guy@gorodish.UUCP (03/12/87)
>I presume O_APPEND on open?
Or when the program starts; turning on O_APPEND with "fcntl" does the
same job, and is the only way to do it if you're writing to the
standard output (since you aren't the one who opened it). Yes,
that's what I meant by "forced-append mode".
henry@utzoo.UUCP (Henry Spencer) (03/16/87)
> I don't see any obvious reason why this *couldn't* happen on any > UNIX system that didn't lock the file table entry while a write was > in progress... It may be that ... it's *less > likely* to happen on a system using the V7 file system, but I don't > see that it's impossible on such a system. Not impossible at all. Running the posted test program on utzoo (a vanilla V7), some NULs show up. Guy's explanation sounds right. -- "We must choose: the stars or Henry Spencer @ U of Toronto Zoology the dust. Which shall it be?" {allegra,ihnp4,decvax,pyramid}!utzoo!henry
jas@rtech.UUCP (Jim Shankland) (03/18/87)
Well, fresh mail made me want to dig at this again. Guy Harris writes: > I don't see any obvious reason why this *couldn't* happen on any > UNIX system that didn't lock the file table entry while a write was > in progress... It may be that ... it's *less > likely* to happen on a system using the V7 file system, but I don't > see that it's impossible on such a system. And Henry Spencer concurs: > Not impossible at all. Running the posted test program on utzoo (a vanilla > V7), some NULs show up. Guy's explanation sounds right. But looking at my (probably ancient) System V source, it appears that while the file table entry is indeed not locked, it is only read once the inode is locked (lines 62-70 of sys2.c, in rdwr(), in my old copy of the source). Therefore, the bug can, in fact, NOT happen on System V, and appears to be limited to 4.xbsd and its commercial derivatives, and also V7 systems (are there any left besides utzoo?:-)) (DON'T answer that; on second thought, I'll bet there are, and I'm going to hear from all of them.) -- Jim Shankland ..!ihnp4!cpsc6a!\ rtech!jas ..!ucbvax!mtxinu!/