jsq@usenix.uucp (John Quarterman) (08/25/87)
From: jsq@usenix.uucp (John Quarterman) Pipe Write Problems Page 1 of 11 IEEE 1003.1 N.116 John S. Quarterman Institutional Representative From USENIX to IEEE P1003 {uunet,ucbvax,seismo}!usenix!jsq Texas Internet Consulting 701 Brazos, Suite 500 Austin, Texas 78701-3243 +1-512-320-9031 jsq@longway.tic.com 24 August 1987 Attention: P1003 Working Group Secretary, IEEE Standards Board 345 East 47th Street New York, NY 10017 Cc: 1003.1 Technical Reviewers: Maggie Lee, 2 Jeff Smits, 6 Hal Jespersen, Rationale +1-408-746-7216 +1-201-522-6263 +1-415-420-6400 ihnp4!amdahl!maggie ihnp4!attunix!smits ucbvax!unisoft!hlj There are several problems in IEEE Std 1003.1, Draft 11 regarding writes to a pipe or FIFO. These problems are sufficient to produce a no ballot from USENIX. This objection includes discussion of the problems, their sources, and suggested solutions, including both standard and rationale text. 1. Problems 1.1 Ambiguous O_NONBLOCK wording in Draft 11, 6.4.2.2. Understanding the case of the triple condition + O_NONBLOCK is set, + and {PIPE_BUF} < nbyte <= {PIPE_MAX}, + and 0 < immediately writable < nbyte, requires a close reading of Draft 11, 6.4.2.2, page 125, lines 224-227: $Revision: 3.1 $ DRAFT $Date: 87/08/24 10:54:56 $ Pipe Write Problems Page 2 of 11 IEEE 1003.1 N.116 If the O_NONBLOCK flag is set, write() shall not block the process. If nbyte > {PIPE_BUF}, and some data can be written without blocking the process, write() shall write what it can and return the number of bytes written. Otherwise, it shall return -1 and errno shall be set to [EAGAIN]. It is not immediately obvious what ``Otherwise'' refers to (which clause of the condition?). But in the context of the paragraph at lines 217-221 it must refer to the case when {PIPE_BUF} < nbyte <= {PIPE_MAX} and no data can be written without blocking the process. 1.2 Nonblocking partial pipe writes are an option in Draft 11. According to David Willcox, who was in many of the atomic pipe write small groups, the word ``can'' in both uses in the preceding quote is meant to refer to what the implementation permits. In other words, the case where ``some data can be written'' may refer to there being some space free in the pipe, or the case may be null, meaning that [EAGAIN] will always be returned when {PIPE_BUF} < nbyte <= {PIPE_MAX}, regardless of whether there is free space in the pipe or not. Which is to say that the standard permits the implementation to perform partial writes, but does not require it to do so. Partial writes are not implementation-defined (according to the definition in 2.1), because the standard completely describes their behavior (or attempts to). So partial writes are an interface implementation option in Draft 11, even though they are not properly specified as such by the use of the word ``may'' or listing in 2.2.1.2. 1.3 Incorrect error code? If partial writes are not implemented, the error [EAGAIN] is not appropriate, because the write will never succeed, no matter how many times it is retried. Better would be [EINVAL], which matches the other cases where retrying will not help. However, this argument assumes that {PIPE_BUF} is not only the maximum atomic size, but also the maximum amount writable on one operation: this may not be so; see below. $Revision: 3.1 $ DRAFT $Date: 87/08/24 10:54:56 $ Pipe Write Problems Page 3 of 11 IEEE 1003.1 N.116 1.4 {PIPE_MAX} with O_NONBLOCK clear. Should {PIPE_MAX} apply when O_NONBLOCK is not set? All of Version 7, System V Release 3, 4.2BSD, and 4.3BSD permit arbitrarily large values of nbyte when O_NDELAY is not set. While it is possible to imagine a system where such a limit would be required by the implementation, there seem to be none at the moment, so there are probably no applications that depend on it. The enforcement of such a limit would make pipes basically different from other things that write() can be applied to, requiring extra code in applications. Thus there is no obvious advantage in portability for applications. So {PIPE_MAX} should not be applied when O_NONBLOCK is clear. 2. Sources of the problems. There are three basic sources of confusion about the behavior of pipes and FIFOs (especially when the non- blocking flag is set): 1. It is not clear what the various existing systems do. 2. It is clear that they do many things differently. 3. It is not clear what behavior is important to applications, and thus worth standardizing. 2.1 Existing systems. Some of the following descriptions may not be totally accurate, but they should serve to illustrate the point of diversity. + Version 7 introduced atomicity of writes to pipes. The manual page write(2) guarantees that write requests of 4096 bytes or less will not be interleaved with writes from any other process. The purpose of this feature was to allow multiple processes to write to the same pipe while permitting a single reader to parse their data. 4096 also happens to be the size of a pipe, and is fixed at compile time (it is not larger because that would have made pipes large files, that is, they would have had indirect blocks). Any amount (that will fit in an int) of data may be requested on a single call to write(). $Revision: 3.1 $ DRAFT $Date: 87/08/24 10:54:56 $ Pipe Write Problems Page 4 of 11 IEEE 1003.1 N.116 Version 7 does not have a non-blocking flag. + The SVID requires atomicity of writes to pipes when the request is of {PIPE_BUF} bytes or less. This feature may have been introduced from the /usr/group Standard, which had it. There is no maximum write request, regardless of whether O_NDELAY is set. With O_NDELAY set, write requests of less than {PIPE_BUF} bytes either succeed or return zero. Write requests of more than that may also succeed partially, returning the amount written. + 4.2BSD appears to guarantee atomicity of pipe write requests up to 1024 bytes. It will return an error for requests for more than 4096 bytes when the O_NDELAY flag is set. Partial writes are not done. With the flag clear, any size write request will succeed eventually. + 4.3BSD does not guarantee atomicity of any size pipe write (greater than one byte). The maximum amount that can be requested will vary dynamically, as will the maximum amount that can be written on a single operation. With the O_NDELAY flag set, any write of more than one byte may be partial. UCB CSRG is probably amenable to changing this behavior. + Version 8 does not necessarily measure the maximum amount of data that can be written to a pipe on a given operation in bytes, i.e., it may depend on the number of outstanding write requests. There is no nonblocking flag in Version 8 or Version 9. 2.2 Useful behavior. It is more useful to specify how an application should interpret a return value than it is to specify precisely when the implementation shall return it. I believe this observation may be the rope for climbing out of the chronic pipe write morasse. [EAGAIN] should mean that retrying later with the same size request may succeed. The Rationale should recommend actions the application should take in such a case. Because some systems dynamically vary their pipe size, what would have succeeded this time on an empty pipe may not succeed next $Revision: 3.1 $ DRAFT $Date: 87/08/24 10:54:56 $ Pipe Write Problems Page 5 of 11 IEEE 1003.1 N.116 time. Of course, if the request was for {PIPE_BUF} or less bytes, retries shall eventually succeed (unless no reader reads enough from the pipe). But it is not useful for the standard to attempt to specify for exactly what larger requests [EAGAIN] will be returned, or the probability of success on later retries. After all, if the reader does not read, no retries will succeed. [EINVAL] should mean that retrying later with the same size request shall never succeed. But the standard should not require the implementation to always return this error at a fixed limit. There is no reason for the standard to try to specify what happens in every corner case produced by the intersections of all the known implementations. The standard should specify behavior that promotes portability of applications and that is implementable relatively readily on existing systems. In addition, the behavior of writes to pipes or FIFOs should be made as little different from that of writes to other file descriptors as possible. The main reason for making it different at all is that POSIX does not currently include any more sophisticated interprocess communication facility: for example, given a reliable sequenced datagram service, there would be no need to require pipes to be atomic. 1. Atomic writes are useful. The standard should specify that write requests of {PIPE_BUF} or less bytes shall be atomic, regardless of whether O_NONBLOCK is set. 2. Write requests of more than {PIPE_BUF} bytes with O_NONBLOCK set are useful. A real time data acquisition process might want to write large amounts of data through a pipe to a single processing process, while never blocking. 3. Partial writes are useful, but not useful enough for the standard to require the implementation to include them. The standard should require portable applications to expect them, however: since the application should expect them for other kinds of writes, anyway. In other words, partial writes should not be a major option, instead merely an implementation-defined detail. Exactly when they occur is not important enough to specify (especially considering that it is not specified for other kinds of writes), except that they are prohibited when nbyte <= {PIPE_BUF} because of the guarantee of atomicity. $Revision: 3.1 $ DRAFT $Date: 87/08/24 10:54:56 $ Pipe Write Problems Page 6 of 11 IEEE 1003.1 N.116 There is no strong reason for an application to be able to discover at compile or run time whether partial writes are implemented: every application should assume that they may be implemented. The usefulness of {PIPE_MAX} is slightly dubious, and it might be better to eliminate it, instead specifying that [EINVAL] may be returned whenever O_NONBLOCK is set and nbyte > {PIPE_BUF}. But let us assume that it is useful. 1. A maximum amount that can be requested without ever producing [EINVAL] is worthwhile. {PIPE_MAX} could be used for this. But it should not apply if O_NONBLOCK is not set. 2. {PIPE_MAX} >= {PIPE_BUF}. Allowing {PIPE_MAX} < {PIPE_BUF} would permit a guaranteed atomic write to return [EINVAL], which is a contradiction. 3. The standard should explicitly permit an implementation to set {PIPE_MAX} = {PIPE_BUF}, simply because there is no reason to prohibit it. This would not rule out partial writes, but would mean that applications running on such an implementation should never depend on successful writes with nbyte > {PIPE_BUF}. 4. The standard should permit an implementation to set {PIPE_MAX} = {INT_MAX}, meaning that [EINVAL] will never be returned. That is effectively what some implementations do, and there is no reason not to if partial writes are implemented. 5. An implementation could even set all three limits equal: {PIPE_BUF} = {PIPE_MAX} = {INT_MAX}, meaning that [EINVAL] will never be returned, there are no partial writes, and all writes are atomic. Finally, this is an interface standard: it should not try to specify implementation details, such as the internal buffering arrangements of the pipe. Such phrases as ``it shall write as much as it can'' are inappropriate. 3. Rewording. Here is rewording to account for the implications of the above arguments. The text and tables below include specifications and rationale for {PIPE_MAX}. But, if the Working Group decides $Revision: 3.1 $ DRAFT $Date: 87/08/24 10:54:56 $ Pipe Write Problems Page 7 of 11 IEEE 1003.1 N.116 to drop {PIPE_MAX}, it can be excised with no ill effects. References to it should then also be removed from Draft 11 2.9.2, page 42, lines 808-810, and 5.7.1.2, page 117, line 971. 3.1 Standard. Move the definition of {PIPE_MAX} down into the text that specifies what happens when O_NONBLOCK is set. That is, first remove Draft 11 6.4.2.2, page 125, lines 215-216: Write requests for greater than {PIPE_MAX} bytes shall result in a return of value of -1 and set errno to [EINVAL]. Then replace the wording (quoted in 1.1 above) of Draft 11, 6.4.2.2, page 125, lines 224-227 with this new wording: If the O_NONBLOCK flag is set, write requests shall be handled differently in the following ways: The write() function shall not block the process. Write requests for {PIPE_BUF} or less bytes shall either succeed completely and return nbyte, or return -1 and set errno to [EAGAIN] to indicate that retrying the write() later with the same arguments may succeed. Write requests for more than {PIPE_BUF} bytes may in addition write some amount of data less than nbyte and return the amount written. Write requests for more than {PIPE_MAX} bytes may in addition return -1 and set errno to [EINVAL] to indicate that retrying the write() later with the same arguments shall never succeed. {PIPE_MAX} shall be greater than or equal to {PIPE_BUF} and less than or equal to {INT_MAX}. The beginning of the following paragraph, 6.4.2.2, page 125, lines 228-229, is misleading and should be changed from When attempting to write to a file descriptor... to When attempting to write to a file descriptor (other than one for a pipe or FIFO)... The meaning of [EINVAL] when set by write() as specified in 6.4.2.4, page 126, lines 260-261, should be changed from [EINVAL] An attempt was made to write more than {PIPE_MAX} bytes to a pipe or FIFO special file. to $Revision: 3.1 $ DRAFT $Date: 87/08/24 10:54:56 $ Pipe Write Problems Page 8 of 11 IEEE 1003.1 N.116 [EINVAL] An attempt was made to write to a pipe or FIFO special file with a value of nbyte greater than {PIPE_MAX} and also large enough that the operation shall never succeed if retried. 3.2 Rationale. In the Rationale, remove the editorial note from B.6.4.2, Page 240, line 2104, and replace B.6.4.2, Page 240, line 2105 (``Write to a Pipe'') with: [begin replacement] An attempt to write to a pipe or FIFO has several major characteristics: Atomic/non-atomic A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. We call this maximum {PIPE_BUF}. The standard does not say whether write requests for more than {PIPE_BUF} bytes will be atomic, but requires that writes of {PIPE_BUF} or less bytes shall be atomic. Blocking/immediate Blocking is only possible with O_NONBLOCK clear. If there is enough space for all the data requested to be written immediately, the implementation should do so. Otherwise, the process may block, that is, pause until enough space is available for writing. The effective size of a pipe or FIFO (the maximum amount that can be written in one operation without blocking) may vary dynamically, depending on the implementation, so it is not possible to specify a fixed value for it. Complete/partial/deferred A write request, int fildes, nbyte, ret; char *buf; ret = write(fildes, buf, nbyte); may return complete: ret = nbyte $Revision: 3.1 $ DRAFT $Date: 87/08/24 10:54:56 $ Pipe Write Problems Page 9 of 11 IEEE 1003.1 N.116 partial: ret < nbyte This shall never happen if nbyte <= {PIPE_BUF}. If it does happen (with nbyte > {PIPE_BUF}), the standard does not guarantee atomicity, even if ret <= {PIPE_BUF}, because atomicity is guaranteed according to the amount requested, not the amount written. deferred: ret = -1, errno = [EAGAIN] This error indicates that a later request may succeed. It does not indicate that it shall succeed, even if nbyte <= {PIPE_BUF}, because if no process reads from the pipe or FIFO, the write will never succeed. An application could usefully count the number of times [EAGAIN] is caused by a particular value of nbyte > {PIPE_BUF} and perhaps do later writes with a smaller value, on the assumption that the effective size of the pipe may have decreased. Partial and deferred writes are only possible with O_NONBLOCK set. Requestable/invalid If a write request shall never succeed with the value given for nbyte, the request is invalid, and write() shall return -1 with errno set to [EINVAL]. This is only permitted to happen when nbyte > {PIPE_MAX} and O_NONBLOCK is set, and it is never required to happen. {PIPE_MAX} is not necessarily a minimum on the effective size of a pipe or FIFO; if it says anything about that size, it is that it sometimes varies above {PIPE_MAX}. Because {PIPE_MAX} specifies the maximum size write request that shall never cause [EINVAL], it must be greater than or equal to the maximum atomic write size, {PIPE_BUF}. {PIPE_BUF} and {PIPE_MAX} may be equal, which means that [EINVAL] may be produced by any write of greater than {PIPE_BUF} bytes. {PIPE_MAX} may be equal to {INT_MAX}, meaning that [EINVAL] shall never be returned (unless nbyte > {INT_MAX}, when the result is implementation-defined). All three limits may be equal, meaning that [EINVAL] shall never be returned, no partial writes are done, and all completed writes are atomic. Applications should be prepared for all these cases. The relations of these properties are best shown in tables. $Revision: 3.1 $ DRAFT $Date: 87/08/24 10:54:56 $ Pipe Write Problems Page 10 of 11 IEEE 1003.1 N.116 ________________________________________________ | Write to a Pipe or FIFO with O_NONBLOCK clear.| |_____________|_________________________________| | immediately | | | writable: | none some nbyte | |_____________|_________________________________| | | atomic atomic atomic | | nbyte <= | blocking blocking immediate| | {PIPE_BUF} | nbyte nbyte nbyte | |_____________|_________________________________| | | atomic? atomic? atomic? | | nbyte > | blocking blocking immediate| | {PIPE_BUF} | nbyte nbyte nbyte | |_____________|_________________________________| If the O_NONBLOCK flag is clear, a write request shall block if the amount writable immediately is less than that requested. If the flag is set (by fcntl()), a write request shall never block. __________________________________________________________ | Write to a Pipe or FIFO with O_NONBLOCK set. | |____________|____________________________________________| | immediately| | | writable: | none some nbyte | |____________|____________________________________________| | nbyte <= | -1, -1, atomic | | {PIPE_BUF} | [EAGAIN] [EAGAIN] nbyte | |____________|____________________________________________| | | atomic? atomic? | | | < nbyte <=nbyte | | nbyte > | -1, or -1, or -1, | | {PIPE_BUF} | [EAGAIN] [EAGAIN] [EAGAIN] | |____________|____________________________________________| | | atomic? atomic? | | | < nbyte <=nbyte | | nbyte > | -1, or -1, or -1, | | {PIPE_MAX} | ([EAGAIN] ([EAGAIN] ([EAGAIN] | | | or [EINVAL]) or [EINVAL]) or [EINVAL])| |____________|____________________________________________| There is no way provided for an application to determine whether the implementation will ever perform partial writes to a pipe or FIFO. Every application should be prepared to handle partial writes when O_NONBLOCK is set and the requested amount is greater than {PIPE_BUF}, just as every application should be prepared to handle partial writes on other kinds of file descriptors. Where the standard requires -1 returned and errno set to [EAGAIN], most historical implementations return 0 (with the $Revision: 3.1 $ DRAFT $Date: 87/08/24 10:54:56 $ Pipe Write Problems Page 11 of 11 IEEE 1003.1 N.116 O_NDELAY flag set: that flag is the historical predecessor of O_NONBLOCK, but is not itself in the standard). The error indications in the standard were chosen so that an application can distinguish these cases from end of file. While write() cannot receive an indication of end of file, read() can, and the Working Group chose to make the two functions have similar return values. Also, some existing systems (e.g., Version 8) permit a write of zero bytes to mean that the reader should get an end of file indication: for those systems, a return value of zero from write indicates a successful write of an end of file indication. [end replacement] $Revision: 3.1 $ DRAFT $Date: 87/08/24 10:54:56 $ CONTENTS 1. Problems............................................. 1 1.1 Ambiguous O_NONBLOCK wording in Draft 11, 6.4.2.2......................................... 1 1.2 Nonblocking partial pipe writes are an option in Draft 11..................................... 2 1.3 Incorrect error code?........................... 2 1.4 {PIPE_MAX} with O_NONBLOCK clear................ 3 2. Sources of the problems.............................. 3 2.1 Existing systems................................ 3 2.2 Useful behavior................................. 4 3. Rewording............................................ 6 3.1 Standard........................................ 7 3.2 Rationale....................................... 8 - i - Volume-Number: Volume 12, Number 22