emv@msen.com (Ed Vielmetti) (06/22/91)
Content-type: text-plus/richtext What follows is a draft of a proposal to create a format for multimedia RFC-822 mail (and by extension multimedia Usenet News, since they follow the same model. It is notable for containing a specification for a minimal <bold>SGML</bold>-compatible format for "richtext", a text markup intended to be easy to parse and rich enough to be useful for real applications. Comments on the draft can go to the addresses below or to this group. I've tried to type this message best I can in the style of the draft, so that if you have a conformant implmentation you'll be able to click on the box below and get a copy of the postscript version. <x-signature> Edward Vielmetti, MSEN Inc. moderator, comp.archives emv@msen.com <x-snappy-signature-quote> On the Net, the Net-way is best. It's just that we are trying to figure out what the Net-way is. e. miya </x-snappy-signature-quote> </x-signature> --richmail-internet-draft Content-Type: application/external-reference; name=/pub/nsb/BodyFormats.ps; real-type=text-plus/postscript; site=thumper.bellcore.com; expiration="23 Sep 1991 12:00:00 -0400" --richmail-internet-draft INTERNET DRAFT Mechanisms for Specifying and Describing the Format of Internet Message Bodies Nathaniel Borenstein, Bellcore Ned Freed, Innosoft June 1991 Status of This Memo This draft document will be submitted to the RFC editor as a protocol specification. Distribution of this memo is unlimited. Please send comments to Nathaniel Borenstein <nsb@thumper.bellcore.com>. Experimentation with the mechanisms described in this memo is encouraged. It is anticipated that such experimentation will take place during the summer of 1991, after which a new draft will be submitted to the RFC editor. Comments that are intended to affect that future draft should be received no later than September 23, 1991. Abstract This document suggests extensions to the RFC 822 message representation protocol to allow multi-part textual and non-textual messages to be represented and exchanged without loss of information. This is based on earlier work documented in RFC 934 and RFC 1049, but extends and revises that work. In particular, it is designed to permit and standardize Internet mail mechanisms for representing text in character sets other than US-ASCII, for including formatted multi-font text messages, for including non- textual material such as images and audio fragments, and for generally extending Internet mail to include new types of objects that are tagged in such a way that cooperating mail agents can recognize their types. 2 Internet Message Body Format INTERNET DRAFT Contents 1 Introduction 2 The Content-Type Header Field 3 The Content-Transfer-Encoding Header Field 3.1 Quoted-Printable Content-Transfer-Encoding 3.2 Base64 Content-Transfer-Encoding 4 Additional Optional Content- Header Fields 4.1 Optional Content-ID Header Field 4.2 Optional Content-Description Header Field 5 The Predefined Content-type Values 5.1 The TEXT Content-type and the US-ASCII Character Set 5.2 The "Multipart" Content-Type 5.3 The "Text-Plus" Content-Type and "Richtext" subtype 5.4 The Message Content-Type 5.5 The Binary Content-Type 5.6 The Application Content-Type Value 5.7 The Audio, Image, and Video Content-Type Values 5.8 Experimental ("X-") Content-Type Values 6 Conformance With this Memo Appendix I -- Guidelines For Sending Data Via Email Appendix II -- A Complex Multipart Example Appendix III -- The US-ASCII Character Set Summary Contacts Acknowledgements References INTERNET DRAFT Internet Message Body Format 3 1 Introduction Since its publication in 1982, RFC 822 [RFC-822] has defined the standard format of textual mail messages on the Internet. Its success has been such that the RFC 822 format has been adopted, wholly or partially, well beyond the confines of the Internet and of SMTP transport, as defined by RFC 821 [RFC-821]. As the format has seen wider use, a number of limitations have become increasingly problematic for the user community. RFC 822 was intended to specify a format for text messages. As such, non-text messages, such as multimedia messages that might include audio or images, are simply not mentioned. Even in the case of text, however, RFC 822 is inadequate for the needs of email users whose languages require the use of character sets richer than US ASCII [REF-ANSI]. For mail containing audio, video, Asian language text, or even text in most European languages, RFC 822 does not specify enough to permit interoperability. One of the notable limitations of RFC 821/822 based mail systems is the fact that they limit the contents of electronic mail messages to relatively short lines of seven-bit ASCII. This forces a user to convert any non- textual data that she may wish to send into seven-bit bytes representable as printable ASCII characters before invoking her local mail UA (User Agent program). Examples of such encodings currently used in the Internet include pure hexadecimal, uuencode, the 3-in-4 base 64 scheme specified in RFC 1113, the Andrew Toolkit Representation, and many others. These limitations become even more apparent as gateways are designed to allow for the exchange of mail messages between RFC 822 hosts and X.400 hosts. X.400 [REF-X400] specifies mechanisms for the inclusion of non-textual body parts within electronic mail messages. The current standards for the mapping of X.400 messages to RFC 822 messages specify that either X.400 non-textual body parts should be converted to (not encoded in) an ASCII format, or that they should be discarded, notifying the RFC 822 user that discarding has occurred. This is clearly undesirable, as information that a user may wish to receive is lost. Even though a user's UA may not have the capability of dealing with the non-textual body part, the user might have some mechanism external to the UA that can extract useful information from the body part. Moreover, it does not allow for the fact that the message may eventually be gatewayed back into an X.400 MHS, where the non-textual information would definitely become useful again. This memo describes several mechanisms that combine to solve these problems. In particular, it describes: 4 Internet Message Body Format INTERNET DRAFT 1. A Content-type header field, generalized from RFC 1049 [RFC-1049], which can be used to describe the type and subtype of data in the body of a message and to fully specify the representation (encoding) of such data. 2. A Content-Transfer-Encoding header field, which can be used to describe an auxilliary encoding that was applied to the data in order to allow it to pass through the mail transport layer. 3. A "text" content-type value, which can be used to represent text information in a number of character sets in a standardized manner. 4. A "multipart" content-type value, which can be used to combine several separate body-parts, which may be made of different types of data, into a single message. 5. A "binary" content-type value, which can be used to transmit uninterpreted or partially-interpreted binary data, and hence to implement an email file transfer service. 6. A "message" content-type value, for encapsulating a mail message. 7. Several additional content-type values and subtypes, which can be used by consenting User Agents to interoperate with additional message types such as audio, images, and more. 8. Several optional header fields that can be used to further describe the data in a message body or body-part, in particular the Content-ID, and Content-Description header fields. Finally, to specify and promote a minimal level of interoperability, this memo describes a subset of the above mechanisms that defines "conformance" with this memo. That is, it specifies the minimal subset required for an implementation to be called "XXXX-conformant." INTERNET DRAFT Internet Message Body Format 5 2 The Content-Type Header Field The Content-Type header field was first defined in RFC 1049. This section extends and supersedes that definition. RFC 1049 content-types are all conformant with the new, more general syntax. (In particular, RFC 1049 content-types omitted the subtype/character-set specification, and always had at most two of the parts now called "parameters", which were distinguished by their position as indicating a version number and a resource reference.) The purpose of the content-type field is to describe the data contained in the message body fully enough that the receiving user agent can pick an appropriate agent or mechanism to present the data to the user, or to otherwise deal with the data in an appropriate manner. The Content-Type header field is used to specify the type of data in a message, by giving a type name, and to provide auxiliary information that may be required for certain types. In addition, a distinguished syntax is defined for specifying subtype information, including character set information in the case of text. After the type name and the optional subtype, the remainder of the header field is simply a set of parameter specifications, as defined for each named type, and an optional comment. In the Extended BNF notation of RFC-822, we define a Content-type header field value as follows: Content-Type:= type ["/" subtype] *[";" parameter] [comment] type := "TEXT" / "TEXT-PLUS" / "MESSAGE" / "AUDIO" / "IMAGE" / "VIDEO" / "BINARY" / "APPLICATION"/ "MULTIPART" / "X-" token subtype := token parameter := token / quoted-string token := 1*<any CHAR except SPACE, CTLs, and tspecials> tspecial := "(" / ")" "<" / ">" / "@" / "," / ":" / "/" / "\" / <"> / "[" / "]" / ";" The type and subtype values are not case sensitive. TEXT, Text, and TeXt are all equivalent. An initial set of nine content-types is defined by this memo. This set of top-level names is intended to be substantially complete. It is expected that additions to the larger set of supported types can usually be accomplished by the creation of new subtypes of these 6 Internet Message Body Format INTERNET DRAFT initial types. In the future, more top-level types should be defined by an extension to this standard. The only constraint on the definition of subtype names is the desire that their uses not conflict. That is, it would be undesirable to have two different communities using "Content-type: binary/foobar" to mean two different things. The process of defining new content-subtypes, then, is not intended to be a mechanism for imposing restrictions, but simply a mechanism for publicizing the usages. There are, therefore, two acceptable mechanisms for defining new content-type subtypes: 1. Private values (starting with "X-") may be defined bilaterally between two cooperating agents without outside approval or standardization 2. "Standard" values may be defined by the publication of an Internet RFC, or by registering them with the Internet Assigned Numbers Authority (IANA) at ISI, by email to IANA@ISI.EDU. The nine standard initial predefined content-types are detailed in the appendices of this memo. They are: text -- textual information, with character set given by the subtype text-plus -- mostly textual information, with embedded formatting commands. A simple default type is defined, with possible subtypes including troff, TeX, and so on. message -- an encapsulated message, with initial subtypes for partial messages and privacy-enhanced messages multipart -- a message consisting of multiple parts of independent type values, with initial subtype digest. audio -- a message containing audio data, with initial subtypes a-law and u-law. image -- a message containing image data, with initial subtypes G3fax, gif, pbm, ppm, and pgm. video -- a message containing video data. binary -- a message containing some other form of binary data. application -- a message containing data to be processed by a mail-based application. If no Content-type header field is present, "text" is generally to be assumed, with the default (US-ASCII) subtype as specified later in this memo. This is consistent with the default message body type as defined by RFC 822. However, this does not mean that a specification of INTERNET DRAFT Internet Message Body Format 7 "Content-type: text/us-ascii" is optional. In the absence of such a header field, it is impossible to be certain that a message is actually text in the US-ASCII character set, because it might well be a message that, using the conventions that predate this memo, includes non-textual data in a manner that cannot be automatically recognized (e.g. a uuencoded compressed UNIX tar file). Although there is no acceptable alternative to treating such untyped messages as "text/us-ascii", implementors should remain aware that unless explicitly so marked, they may in practice be almost anything. It should be noted that the list of Content-type values given here may be augmented in time, via the mechanisms described above, and that the set of subtypes is expected to grow substantially. We have simply attempted, in this memo, to give as many standard Content-type definitions as was possible given the current state of our knowledge. 8 Internet Message Body Format INTERNET DRAFT 3 The Content-Transfer-Encoding Header Field Many content-types which are desired to transport via e-mail are represented, in their "natural" format, as 8-bit character or binary data. Such data can not be properly transmitted over existing Internet mail mechanisms because both RFC 821 and RFC 822 restrict mail messages to 7-bit US-ASCII data with 1000 character lines. It is necessary, therefore, to extend the definition of the data types allowed in the RFC 821 and RFC 822 framework, and to define a standard mechanism for encoding such data in an acceptable manner. This memo specifies that such encodings will be indicated by a new "Content-Transfer-Encoding" header field. The Content-Transfer-Encoding field is used to indicate the type of transformation that has been used to represent the message body in an acceptable manner. It may seem that the Content-Transfer-Encoding could be inferred from the characteristics of the Content-Type that is to be encoded, or, at the very least, certain Content- Transfer-Encodings could be mandated for use with specific Content-Types. There are several reasons why this is not the case. First, given the varying types of transports used for mail, some encodings may be appropriate for some Content- Type/transport combinations and not for others. Second, certain Content-Types may require different types of transfer encoding under different circumstances. For example, many PostScript messages might consist entirely of short lines of 7-bit data and hence require little or no encoding. Other PostScript messages (especially those using Level 2 PostScript's binary encoding mechanism) may only be resonably represented using a binary transport encoding. Finally, since Content-Type is intended to be an open-ended specification mechanism, strict specification of an association between Content-Types and encodings effectively couples the specification of an application protocol with a specific lower-level transport. This is not desirable since the developers of a Content-Type should not have to be aware of all the transports in use and what their limitations are. It should be noted, also, that there is considerable interest and effort being expended on extending mail transport to permit 8-bit or binary data. If such extensions ever become commonplace, the Content-Transfer- Encoding mechanism will quickly become irrelevant, and it is therefore desirable not to "overload" Content-Transfer- Encoding with additional mechanisms that might still be useful in such a future. For this reason, Content- Transfer-Encoding is restricted in its scope to refer to nothing but the 7-bit encoding question. Matters such as the basic format in which information is "encoded" are to be INTERNET DRAFT Internet Message Body Format 9 handled by other mechanisms. Unlike Content-types, which are expected to proliferate, it is expected that there will never be more than a few different Content-Transfer-Encoding values, both because there is less need for variation and because the effect of variation in Content-Transfer-Encoding would be more problematic. However, establishing only a single Content- Transfer-Encoding mechanism does not seem possible. In particular, there is a tradeoff between the desire for a compact and efficient encoding of binary data and the desire for a readable encoding of data that is mostly, but not entirely, 7-bit data. For this reason, at least two encoding mechanisms are necessary, a "readable" encoding and a "dense" encoding. A third encoding, for compressed ("super-dense") data, is also viewed by many as desirable. This memo does not specify a "compressed" encoding, due largely to the uncertain legal state of the UNIX "compress" command and a lack of certainty, during the drafting of this memo, regarding the right way to define a standard compression algorithm. It is hoped that a compressed Content-Transfer-Encoding will be defined in a future RFC. Any compression algorithm for such a use should be unambiguously defined and without legal encumbrances. (Alternate mechanisms for compression have also been proposed, and might be defined in ways that are compatible with this memo.) The Content-Transfer-Encoding field is designed to specify an invertible mapping between the "native" representation of a type of data and a representation that can be readily exchanged using 7 bit mail transport protocols as defined by RFC 821 (SMTP). This field has not been defined by any previous RFC. The field's value is a single atom specifying the type of encoding, as enumerated below. Formally: Content-Transfer-Encoding:= "BASE64"/ "QUOTED-PRINTABLE"/ "8BIT"/ "BINARY"/ "7BIT"/ "X-"atom These values are not case sensitive. That is, Base64 and BASE64 and bAsE64 are all equivalent. An encoding type of 7BIT implies that the message is already in a seven-bit mail-ready representation. This value is assumed if the Content-Transfer-Encoding header field is not present. If the message is stored or transported via a mechanism that permits 8-bit data, a Content-Transfer-Encoding of "8bit" should be used. If the message is stored or transported via a mechanism that permits arbitary binary data, a Content- Transfer-Encoding of "binary" may nonetheless be used. In 10 Internet Message Body Format INTERNET DRAFT particular, "8bit" or "binary" must be used in the case where there is a possibility that the message may "leak" into a more restricted (7-bit) transport environment. (DISCUSSION: The distinction between the Content-Transfer- Encoding values of "binary," "8bit," and "7bit" may seem unimportant in an 8-bit binary environment, but clear labeling will be of enormous value to gateways between 8-bit and 7-bit systems. The difference between "8bit" and "binary" is that "8bit" implies adherence to SMTP limits on line length and CR/LF semantics, whereas "binary" does not.) Implementors may, if necessary, define new Content- Transfer-Encoding values, but should prefix them with "x-" to indicate their non-standard status, e.g. "Content- Transfer-Encoding: x-my-new-encoding". However, unlike Content-types and subtypes, the creation of new Content- Transfer-Encoding values is explicitly discouraged, as it seems likely to hinder interoperability with little potential benefit. If a Content-Transfer-Encoding header field appears as part of a message header, it applies to the entire message body, whether or not that body is of type "multipart." If it is of type multipart, the encoding applies recursively to all of the encapsulated parts, including their encapsulated headers and the encapsulation boundaries. If a Content-Transfer- Encoding header field appears as part of an encapsulation's headers, it applies only to the body of the encapsulated part. If the encapsulated part is itself of type "multipart", the encoding applies recursively to all of the encapsulated parts within that encapsulated part. It should be noted that, because email is character- oriented, the mechanisms described here are mechanisms for encoding arbitrary byte streams, not bit streams. If a bit stream is to be encoded via one of these mechanisms, it should first be converted to a byte stream using the network standard bit order ("big-endian"), in which the earlier bits in a stream become the higher-order bits in a byte. A bit stream not ending at an 8-bit boundary should be padded with zeroes. This RFC does not provide a mechanism for noting the addition of such padding; this information could either be encoded into the data stream or noted in some additional header field. The following sections will define the two standard encoding mechanisms. 3.1 Quoted-Printable Content-Transfer-Encoding The Quoted-Printable encoding is intended to represent data that largely contains octets less than 127. It encodes the data in such a way that the resulting octets are both unlikely to be modified by mail transport, and, when read as INTERNET DRAFT Internet Message Body Format 11 ASCII text, are largely recognisable by humans. A message which is entirely ASCII may also be encoded in Quoted- Printable to insure it's survival in an environment which is anticipated to traverse a character translating gateway such as those onto BITNET. In this encoding, ASCII characters 33 (EXCLAMATION POINT) through 57 (DIGIT 9), inclusive, 59 (SEMICOLON) through 126 (TILDE), inclusive, may be represented as themselves. All other characters, including characters 32 (SPACE), 58 (COLON), 127 (DEL), and all control characters, are to be represented as determined by the following rules: Rule #1: Any 8 bit value may be represented by a ":" followed by a two digit hexadecimal representation of the character's 8-bit value. Thus, for example, character 12 (control-L, or formfeed) can be represented by ":0C", and the colon character (58) itself can be represented by ":3A". Rule #1 is optional for characters 10 (control-J, or linefeed), 13 (control-M, or return), and 32 (SPACE) through 126 (TILDE), and is required for all other characters. Rule #2: The literal colon character must itself be quoted by a colon (i.e., as "::") if Rule #1 is not used. Note that this is not ambiguous with regard to Rule #1, because ":" is not part of the hexadecimal alphabet. Rule #3: A colon at the end of a line may be used to indicate a non-significant line break. That is, if one needs to include a long line without line breaks, a message encoded with the quoted-printable encoding should include "soft" line breaks in which the line break is preceded by a colon. Thus if the "raw" form of the line is a single line that says: Now's the time for all men to come to the aid of their country. This could be represented, in the quoted-printable encoding, as Now's the time : for all men to come: to the aid of their country. This provides a mechanism with which long lines are encoded in such a way as to be restored by the user agent. The quoted-printable encoding REQUIRES that lines be broken so that they are no more than 78 characters long, using soft line breaks when necessary. 12 Internet Message Body Format INTERNET DRAFT Rule #4: SPACE (32) characters may generally be represented as themselves, but should NOT be so represented at the end of a line, because some MTA's are known to remove "white space" from the end of a line. In such cases, the characters MUST be represented as in rule #1 (as ":20") or as themselves, followed by a soft line break followed by a real line break. Of course, these characters may be so represented within a line as well, if this is desired, though this is less readable. Note that in decoding a quoted-printable message, any trailing white space on a line should be deleted, as it will necessarily have been added by intermediate transport agents. Rule #5: A CR LF pair normally constitutes a line break and should be represented by a line break in the quoted-printable encoding if that is its meaning. Isolated CRs, LFs, and LF CR sequences must be represented using the :0D, :0A, and :0A:0D notations respectively. CR LF sequences that are not intended to represent a line break should be encoded as :0D:0A to reflect this usage. In other words, the concept "end of line" is represented, in the quoted-printable encoding, by CR LF, although this may be modified in local storage formats. Literal occurrences of CR or LF that do not occur as CRLF or are not intended to represent end-of-line markers must be represented in hexadecimal. Since the hyphen character ("-") is represented as itself in the Quoted-Printable encoding, the usual care must be taken, when encapsulating a quoted-printable encoded message or body part in a multipart message, to ensure that the encapsulation boundary does not appear anywhere in the message. See the definition of multipart messages later in this memo. It should be noted that the quoted-printable encoding represents something of a compromise between readability and reliability in transport. Message bodies encoded with the quoted-printable encoding will work reliably over most mail gateways, but may not work perfectly over a few gateways, notably those involving translation into EBCDIC. (In theory, an EBCDIC gateway could decode a quoted-printable message and re-encode it using base64, but such gateways do not yet exist.) A higher level of confidence is offered by the base64 Content-Transfer-Encoding. For more information about how to ensure that messages are safe against the vagaries of mail gateways, see Appendix I. 3.2 Base64 Content-Transfer-Encoding The Base64 Content-Transfer-Encoding is designed to represent arbitrary 8 bit data in a form that is not humanly INTERNET DRAFT Internet Message Body Format 13 readable. The encoding and decoding algorithms are simple, but the encoded data is only about 33 percent larger than the unencoded data. This encoding is based on the one used in Privacy Enhanced Mail applications, as defined in RFC 1113. The base64 encoding is adapted from RFC 1113, with two changes: base64 elminates the "*" mechanism for embedded clear text and defines a new syntax for portable end-of-line markers, using the comma character. A 66-character subset of International Alphabet IA5 is used, enabling 6 bits to be represented per printable character. (The extra 65th and 66th characters "=" and "," are used to signify special processing functions.) This subset has the important property that it is represented identicially in IA5 and ASCII, and all characters in the subset are part of the so-called invariant subset of EBCDIC. Other popular encodings such as the encoding used by the UUENCODE utility and the base85 encoding specified as part of Level 2 PostScript do not share these properties, and thus do not fulfill the portability requirements imposed on a binary transport encoding for mail. The encoding process represents 24-bit groups of input bits as output strings of 4 encoded characters. Proceeding from left to right across a 24-bit input group is formed by concatenating 3 8-bit input groups, this is then treated as 4 concatenated 6-bit groups. When encoding a bit stream via the base64 encoding, the bit stream should be presumed to be ordered with the most-significant-bit first. That is, the first bit in the stream will be the high-order bit in the first byte, and the eighth bit with be the low-order bit in the first byte, and so on. Each 6-bit group is used as an index into an array of 64 printable characters. The character referenced by the index is placed in the output string. These characters, identified in Table 1, below, are selected so as to be universally representable, and the set excludes characters with particular significance to SMTP (e.g., ".", "CR", "LF") and to the encapsulation boundaries defined in this RFC (e.g., "-"). 14 Internet Message Body Format INTERNET DRAFT Table 1 Value Encoding Value Encoding Value Encoding Value Encoding 0 A 17 R 34 i 51 z 1 B 18 S 35 j 52 0 2 C 19 T 36 k 53 1 3 D 20 U 37 l 54 2 4 E 21 V 38 m 55 3 5 F 22 W 39 n 56 4 6 G 23 X 40 o 57 5 7 H 24 Y 41 p 58 6 8 I 25 Z 42 q 59 7 9 J 26 a 43 r 60 8 10 K 27 b 44 s 61 9 11 L 28 c 45 t 62 + 12 M 29 d 46 u 63 / 13 N 30 e 47 v 14 O 31 f 48 w (pad) = 15 P 32 g 49 x (eol) , 16 Q 33 h 50 y The output stream (encoded bytes) must be represented in lines of no more than 76 characters each. All line breaks in the encoded version of the data should be ignored by decoding software. Special processing is performed if fewer than 24 bits are available at the end of a message or encapsulated part of a message. A full encoding quantum is always completed at the end of a message. When fewer than 24 input bits are available in an input group, zero bits are added (on the right) to form an integral number of 6-bit groups. Output character positions which are not required to represent actual input data are set to the character "=". Since all canonically encoded output is an integral number of octets, only the following cases can arise: (1) the final quantum of encoding input is an integral multiple of 24 bits; here, the final unit of encoded output will be an integral multiple of 4 characters with no "=" padding, (2) the final quantum of encoding input is exactly 8 bits; here, the final unit of encoded output will be two characters followed by two "=" padding characters, or (3) the final quantum of encoding input is exactly 16 bits; here, the final unit of encoded output will be three characters followed by one "=" padding character. One addition is made to the RFC 1113 specification of this encoding: The comma character (",", ASCII 44) may be used to represent an "end-of-line" or "end-of-record" marker. If line-oriented data are encoded using base64, it is desirable to restore end-of-line markers according to the local convention. The RFC 1113 specification, as given above, offers no way to differentiate between a binary file INTERNET DRAFT Internet Message Body Format 15 including a CRLF sequence and a portable end-of-line marker. This memo augments that mechanism to permit such differentiation, as follows. To represent an end-of-line marker: 1. Treat the byte stream preceding the end-of- line as terminating with at the end of the line -- that is, pad with "=" characters as appropriate to complete the representation of the line. 2. Insert a comma character. 3. Resume the encoding starting a new 24-bit input group with the first character on the next line. Thus, while encoding the sequence "a-b-c-CR-LF-a-b-c" (or, in hexadecimal, "61 62 63 0D 0A 61 62 63") yields the octets which are represented in ASCII as "YWJjDQphYmM=", encoding "a-b-c" followed by an end-of-line followed by "a-b-c" (in hex, "61 62 63" end-of-line "61 62 63") yields "YWJj,YWJj" They will be translated back into the same thing if the local end-of-line convention is ASCII "CRLF" (hexadecimal "0D 0A"), but they will be translated back differently if the end-of-line convention is anything other than ASCII CRLF (hexadecimal "0D 0A"). (Note: The utliity of the portable end-of-line feature is somewhat limited. Most line-oriented data are best represented by the quoted-printable encoding. A few cases, however, might benefit from this mechanism, notably line-oriented textual data in character sets that bear no resemblance to ASCII.) Note: There is no need to worry about quoting apparent encapsulation boundaries within base64-encoded parts of multipart messages, because no hyphen characters are used in the base64 encoding. 16 Internet Message Body Format INTERNET DRAFT 4 Additional Optional Content- Header Fields 4.1 Optional Content-ID Header Field In constructing a high-level user agent, it may be desirable to allow one message body-part to make reference to another. This may be done using the "Content-ID" header field, which is syntactically identical to the "Message-ID" header field: Content-ID := "<" msg-id ">" 4.2 Optional Content-Description Header Field It may be desirable to associate some descriptive information with a given body-part. For example, it may be useful to mark an "image" body-part as "a picture of the Space Shuttle Endeavor." Such text may be placed in the Content-Description header field. Content-Description := *text INTERNET DRAFT Internet Message Body Format 17 5 The Predefined Content-type Values This memo defines nine initial content-type values and an extension mechanism for private or experimental types. Further types must be defined and published by a new RFC. It is expected that most innovation in new types of mail take place as subtypes of the nine types defined here. 5.1 The TEXT Content-type and the US-ASCII Character Set The text content-type is intended for sending textual email. It is the default content-type. Subtype names are used, for text, to indicate character sets. The default content-type for internet mail is "text", and the default subtype (character set) is "US-ASCII". Alternately, a different character set subtype may be specified, in which case the body text is in the specified character set. A recommended list of predefined subtype names can be found at the end of this section. Note that if the specified character set includes 8-bit data, the Content-Transfer-Encoding header field is required in order to transmit the message via SMTP. The default character set, US-ASCII, has been the subject of some confusion and ambiguity in the past. Not only were there some ambiguities in the definition, there have been wide variations in practice. In order to elminate such ambiguity and variations in the future, it is strongly recommended that new user agents explicitly specify a character set via the content-type header field. US-ASCII is not an arbitrary seven-bit character code, but indicates that the message body uses character coding that uses the exact correspondence of codes to characters specified in ASCII. National use variations of ISO646 [REF-ISO646] are not ASCII, and neither an explicit "ASCII" character set, nor "US-ASCII", nor the default (omission of a character set) should be used when characters are coded using them. (Discussion: RFC821 very explicitly specifies "ASCII", and references an earlier version of the American Standard cited in [REF-ANSI]. Whether that specification, rather than a reference to an International Standard, was done deliberately or out of convenience or ignorance, is no longer interesting: insofar as one of the purposes of specifying a content-type and character set is to permit the receiver to unambiguously determine how the sender intended the coded message to be interpreted, assuming anything other than "strict ASCII" as the default would risk unintentional and incompatible changes to the semantics of messages now being transmitted. This also implies that messages containing characters coded according to national variations on ISO646, or using code-switching procedures (e.g., those of ISO2022), as well as 8-bit or multiple 18 Internet Message Body Format INTERNET DRAFT octet character encodings MUST use an appropriate character set specification to be consistent with this specification.) The complete US-ASCII character set is listed in Appendix III. Note that the control characters (0-31) and delete (127) have no defined meaning apart from the combination <CR><LF> (ASCII values 13 and 10) indicating a new line. Two of the characters have de facto meanings in wide use: <FF> (ASCII 12) as the first character of a line means "start this line on the beginning of a new page"; and <TAB> (ASCII 9) means "move the cursor to the next available position 8n+1 after the next postion". Apart from this any use of the control characters or DEL in a message must be part of a private agreement between the sender and recipient. Such private agreements are discouraged and should be replaced by the other capabilities of this memo." Beyond US-ASCII, one can imagine an enormous proliferation of character sets. It is the opinion of the authors of this memo that a large number of character sets is NOT a good thing. We would prefer to specify a single character set that can be used universally for representing all of the world's languages in electronic mail. Unfortunately, there is no clear choice for such a universal representation, and existing practice in several communities seems to point to the continuing use of multiple character sets in the near future. For this reason, we define names for a small number of character sets for which a strong consituent base exists. We recommend the use of ISO-10646 wherever possible. The defined subtypes of text, which name alternate character sets, are: US-ASCII -- as defined above. ISO-8859-X -- where "X" is to be replaced, as necessary, for the national use variants of ISO-8859 [REF-ISO-8859]. Note that the ISO- 646 character sets have deliberately been omitted in favor of their 8859 replacements, which are the designated character sets for Internet mail. The ISO-8859 character sets will be rigorously defined, for use in mail and other applications, by a forthcoming RFC. Note that the character set used should always be explicitly specified in the Content-type field. The following three subtypes of text are expected to be defined by forthcoming documents. Their use is not recommended in advance of those publications: INTERNET DRAFT Internet Message Body Format 19 ISO-10646 -- as defined in [REF-ISO-10646]. This standard is not, as of this writing, finalized, and therefore its use for email cannot be fully specified. ISO-2022 -- ISO-2022 -- ISO-2022, as defined in [REF-ISO-2022], is problematic for mail use because it actually specifies ways of designating and accessing character sets, rather than, itself, being a character set. Its use in mail will probably be strongly desired by communities who are already using it locally to handle multiple sets of characters and multi-byte characters. It appears necessary to explicitly specify the ISO-2022 methods that will be permitted in text mail so as to avoid the need for private agreements about, e.g., the specific character sets being used in messages. It is expected that those interested in ISO-2022 mail will devise and publish such a specfication in the future. QUOTED-READABLE -- A format for representing text in multiple character sets, as defined in [REF-RFC-QR]. Implementors are discouraged from defining new character sets for mail use unless absolutely necessary. The intent of "text" is to represent "unformatted" text in an appropriate character set. Formatted text, such as multi-font text, should use the "text-plus" content-type. 20 Internet Message Body Format INTERNET DRAFT 5.2 The "Multipart" Content-Type In the case of multiple part messages, a "multipart" Content-type field should appear in the RFC 822 message header. The message body is then assumed to contain multiple parts separated by encapsulation boundaries. Each of the parts is defined, syntactically, as a header area, a blank line, and a body area, similar to the RFC 822 syntax for a message. However body parts are NOT to be interpreted as actually being RFC 822 messages. To begin with, NO header fields are actually required in body parts. A body part that starts with a blank line, therefore, is a body part for which all default values are to be assumed. In such a case, of course, the absence of a Content-type header field implies that the encapsulation is US-ASCII text. The only header fields that have defined meaning for body-parts are those the names of which begin with "Content-". All other header fields are generally to be ignored in body-parts, and may be discarded by gateways. They are permitted to appear in body parts only for ease of conversion between messages and body parts. Of course, "X-" field may be created for experimental or private purposes, with the recognition that the information they contain may be lost at some gateways. It must be understood that body parts are NOT messages. For example, a gateway between Internet and X.400 mail must be able to tell the difference between a body part that consists of an image and a bodypart that consists of an encapsulated message, the body of which is an image. In order to represent the latter, the body part should have "Content-type: message", and its body (after the blank line) should be the encapsulated message, with its own "Content- type: image" header field. Body parts use the same syntax as messages because there are many legitimate cases in which a body part might be converted into a message, or vice versa. The identical syntax makes such conversions easy, but must be understood by implementors. (For the special case in which all parts are actually messages, a "digest" subtype is also defined.) As stated previously, each pair of consecutive body parts are separated by an encapsulation boundary. The encapsulation boundary MUST NOT appear inside any of the encapsulated parts. Thus, it is crucial that the composing agent be able to choose and specify the boundary that will separate the parts. The Content-type field for multipart messages requires two supplementary fields. The first is used to specify a version number and must be either "1-S" and "1-P". The two versions have identical syntax, but the "-P" is intended as a hint, to receivers, that the parts are intended to be viewed in parallel rather than sequentially. Implementations that can not show the parts in parallel, or that choose not to do INTERNET DRAFT Internet Message Body Format 21 so, are free to treat all multipart messages of version "1- P" as if they were version "1-S". However, all implementations must check the version number, to ensure graceful behavior in the event that an incompatible future version of multipart messages is defined later. Future version numbers will always start with an integer for the primary version number, followed by a hyphen and (possibly) some additional text. The second parameter, which is always required for multipart messages, is used to specify the format of the encapsulation boundary. The encapsulation boundary is defined as a line consisting entirely of two hyphen characters ("-", decimal code 45) followed by the second parameter of the Content- type header field with any leading or trailing white space removed. (DISCUSSION: The specification that white space be removed is intended to eliminate the possible introduction of ambiguity caused by the addition or deletion of white space by message transport agents. The hyphens are for rough compatibility with the earlier RFC 934 method of message encapsulation, and for ease of searching for the boundaries in some implementations. However, it should be noted that multipart messages are NOT completely compatible with RFC 934 encapsulations; in particular, they do not obey RFC 934 quoting conventions for embedded lines that begin with hyphens.) Thus, a typical multipart content-type header field might look like this: Content-type: multipart; 1-S; gc0p4Jq0M2Yt08jU534c0p This indicates that the message consists of several parts, each itself structured as an RFC 822 message, which are intended to be viewed one-at-a-time, and that the parts are separated by the line --gc0p4Jq0M2Yt08jU534c0p The encapsulation boundaries must not appear within the encapsulations, and should be no longer than 70 characters, not counting the two leading hyphens. The encapsulation boundary following the last body-part should be a distinguished delimiter that indicates that no further body-parts will follow. Such a delimiter is identical to the previous delimiters, with the addition of two more hyphens at the end of the line: --gc0p4Jq0M2Yt08jU534c0p-- It should be noted that there appears to be room for additional information prior to the first encapsulation boundary and following the final such boundary. For several 22 Internet Message Body Format INTERNET DRAFT reasons, however, it is specified that these areas should be left blank, and that implementations should ignore anything that appears before the first boundary or after the last one. (The most important reasons are the lack of proper typing of these parts and lack of clear semantics for handling these parts at gateways, particularly X.400 gateways.) The use of "Content-Type: Multipart" as a message part within another "Content-Type: Multipart" is explicitly allowed. In such cases, for obvious reasons, care must be taken to ensure that each nested mulitpart message should use a different boundary delimiter. See Appendix II for an example of nested multipart messages. The use of content-type "Multipart" with only a single included part may be useful in certain contexts, and is explicitly permitted. Overall, the body of a multipart message may be specified as follows: body := 1*encapsulation close-delimiter encapsulation := delimiter CRLF message delimiter := "--" <delimiter from Content-type resource> close-delimiter := delimiter "--" message = <as defined in RFC 822, with all header fields optional, and with the specified delimiter not occurring anywhere in the body, either on a line by itself or as a substring anywhere.> The above description defines the default subtype of the multipart type, "mixed", which may be explicitly specified with a content-type of "multipart/mixed". Other subtypes are possible, but should be defined to be syntactically compatible with the "mixed" subtype. Unrecognized subtypes should be treated as being of subtype "mixed." (DISCUSSION: Conspicuously missing from the multipart type is a notion of structure. In general, it seems premature to try to standardize structure yet. It is recommended that those wishing to provide a more structured or integrated multipart messaging facility should define a subtype of multipart that is syntactically identical, but that always expects the inclusion of a distinguished part (e.g. with a content-type of "Application/x-my-structure-subtype") that can be used to specify the structure and integration of the other parts, probably referring to them by their Content-ID field. If this approach is used, other implementations will not recognize the subtype, but will treat it as the default subtype (multipart/mixed) and will thus be able to show the INTERNET DRAFT Internet Message Body Format 23 user the parts that are recognized.) This memo defines one particular subtype of multipart, the "digest" subtype. This type is syntactically identical to multipart, but the semantics are different. In particular, in a digest, all of the parts are assumed to be of type "Message". That is, each part is implicitly prefixed by a line that says "Content-type: message" followed by a blank line. This is provided in order to allow a more readable digest format that is largely compatible (except for the quoting convention) with RFC 934. 24 Internet Message Body Format INTERNET DRAFT 5.3 The "Text-Plus" Content-Type and "Richtext" subtype There are many formats for representing what might be known as "extended text" -- text with embedded formatting and presentation information. An interesting characteristic of most such representations is that they are to some extent readable even without the software that interprets them. It is useful, then, to distinguish them, at the highest level, from such non-readable data as images or audio messages. In the absence of appropriate interpreting software, it is reasonable to show extended text to the user, while it is not reasonable to do so with binary data. To represent such data, this memo defines a "text-plus" content-type. Plausible subtypes of text-plus are typically given by the common name of the representation format, e.g. "text-plus/Troff" or "text-plus/TeX". Character sets are not specified as subtypes; in general it is assume that rich text formats will have their own mechanisms for representing alternate or multiple character sets. However, a subtype can be defined to permit such a specification, e.g. "text- plus/troff; charset=ISO-8859-1". Initial subtypes include troff, TeX, and PostScript In order to promote the wider interoperability of simple formatted text, this memo defines an extremely simple subtype of "text-plus", the "richtext" subtype. This subtype was designed to meet the following criteria: 1. The syntax is extremely simple to parse, so that even teletype-oriented mail systems can easily strip away the formatting information and leave only the readable text. 2. The syntax is easily extended to allow for new formatting commands that are deemed essential. 3. The capabilities are extremely limited, to ensure that it can represent no more than is likely to be representable by the user's primary word processor. While this limits what can be sent, it increases the likelihood that it can be properly displayed. 4. The syntax is compatible with SGML, so that, with an appropriate DTD (Document Type Definition, the standard mechanism for defining a document type using SGML), a general SGML parser could be made to parse richtext. (However, richtext is several orders of magnitude simpler than full SGML, and no SGML knowledge is required in order to understand the richtext specification.) INTERNET DRAFT Internet Message Body Format 25 The syntax of "richtext" is very simple. It is assumed, at the top-level, to be in the US-ASCII character set. All characters represent themselves, with the exception of the "<" character (ASCII 60), which is used to begin a formatting sequence. Formatting sequences consists of formatting commands surrounded by angle brackets ("<>", ASCII 60 and 62). Each formatting command may be no more than 40 characters in length. Formatting commands that begin with a forward slash or solidus (ASCII 47) are negations, and such negations must always exist to balance the initial opening commands. Thus, if the formatting sequence "<bold>" appears at some point, there must later be a "</bold>" to balance it. There are only two exceptions to this "balancing" rule: First, the command "<lt>" may be used to represent a literal "<" character. Second, the command "<nl> may be used to represent a line break. (NOTE: These are intended to be mnemonic: "lt" stands for "less than", and "nl" stands for "new line".) Initially defined formatting commands are: Bold -- causes the subsequent text to be in a bold font. Italic -- causes the subsequent text to be in an italic font. Fixed -- causes the subsequent text to be in a fixed width font. Smaller -- causes the subsequent text to be in a smaller font. Bigger -- causes the subsequent text to be in a bigger font. Underline -- causes the subsequent text to be underlined. Center -- causes the subsequent text to be centered. FlushLeft -- causes the subsequent text to be left justified. FlushRight -- causes the subsequent text to be right justified. Indent -- causes the subsequent text to be indented at both margins. Subscript -- causes the subsequent text to be interpreted as a subscript. Superscript -- causes the subsequent text to be interpreted as a superscript. ISO-10646 -- causes the subsequent text to be interpreted as text in the ISO-10646 character set. ISO-8859-X (for any registered value of X) -- causes the subsequent text to be interpreted as text in the appropriate character set. US-ASCII -- causes the subsequent text to be interpreted as text in the US-ASCII character set. Although this is the default character set, it might be usefully nested inside another character 26 Internet Message Body Format INTERNET DRAFT set. Excerpt -- causes the subsequent text to be interpreted as a textual excerptfrom another message. Typically this will be displayed using indentation and an alternate font, but such decisions are up to the viewer. Comment -- causes the subsequent text to be interpreted as a comment, and hence not shown to the reader. (Comments may be used, among other things, for annotating richtext documents with information that will be useful upon translation into some richer document format.> No-op -- has no effect on the subsequent text. Each formatting command affects all subsequent text until the matching </token>. Such pairs of tokens must be properly balanced. Thus, the proper way to describe text in bold italics is: <bold><italic>the-text</italic></bold> and, in particular, the following is illegal richtext: <bold><italic>the-text</bold></italic> Implementations should regard any unrecognized formatting token as equivalent to "No-op", thus facilitating future extensions to "richtext". Private extensions may defined using formatting tokens that begin with "X-", by analogy to Internet mail headers. Richtext also differentiates betweeen "hard" and "soft" line breaks. A line break (CR LF) in the richtext data stream is interpreted as a "soft" line break, one that is included only for purposes of mail transport, and is to be treated as white space by richtext interpreters. To include a "hard" line break (one that should be displayed as such), the "<nl>" formatting token should be used. Putting all this together, the following "text- plus/richtext" body fragment: <bold>Now</bold> is the time for <italic>all</italic> good men <smaller>(and <lt>women>)</smaller> to come <ignoreme> to the aid of their <nl> beloved <nl><nl>country. <comment> Stupid quote! </comment> -- the end INTERNET DRAFT Internet Message Body Format 27 represents the following formatted text (which will, no doubt, look cryptic in the text-only version of this memo): Now is the time for all good men (and <women>) to come to the aid of their beloved country. -- the end A minimal richtext implementation is one that simply converts "<lt> to "<", converts CRLFs to SPACE, converts <nl> to a newline according to local newline convention, removes everything between a <comment> token and the next following </comment> token, and removes all other formatting tokens (all text enclosed in angle brackets). NOTE ON THE RELATIONSHIP OF RICHTEXT TO SGML: Richtext is decidedly not SGML, and should not be used to transport arbitrary SGML documents. Those who wish to use SGML document types as a mail transport format should define a new text-plus subtype, e.g. "text-plus/sgml-dtd-whatever". Richtext is designed to be compatible with SGML, and specifically so that it will be possible to define a richtext DTD if that is desired. However, this does not imply that arbitrary SGML can be called richtext, nor that richtext implementors have any need to understand SGML; the description in this memo is a complete definition of richtext. 28 Internet Message Body Format INTERNET DRAFT One of the major goals in the design of richtext is to make it so simple that even text-only mailers would implement richtext-to-plain-text translators, thus increasing the likeihood that multifont text will become "safe" to use very widely. To demonstrate this simplicity, what follows is a 31-line C program that converts richtext input into plain text output: #include <stdio.h> #include <ctype.h> main() { int c, i; char token[50]; while((c = getc(stdin)) != EOF) { if (c == '<') { for (i=0; (c = getc(stdin)) != '>'; ++i) { token[i] = isupper(c) ? tolower(c) : c; } token[i] = NULL; if (!strcmp(token, "lt")) { putc('<', stdout); } else if (!strcmp(token, "nl")) { putc('\n', stdout); } else if (!strcmp(token, "comment")) { while (strcmp(token, "/comment")) { while ((c = getc(stdin)) != '<') ; for (i=0; (c = getc(stdin)) != '>'; ++i) { token[i] = isupper(c) ? tolower(c) : c; } token[i] = NULL; } } /* Ignore all other tokens */ } else if (c != '\n') { putc(c, stdout); } } putc('\n', stdout); /* for good measure */ } INTERNET DRAFT Internet Message Body Format 29 5.4 The Message Content-Type It is frequently desirable, in sending mail, to encapsulate another mail message. For this common operation, a special content-type, "message", is hereby defined. A content-type of "message" with no subtype indicates that the body or body part is an encapsulated message, with the syntax of an RFC 822 message, as extended by this memo. The special subtype "pem" may be used to indicate that the body or body part is a message conforming to the Privacy Enhanced Mail protocol [RFC-1113]. The special subtype "partial" may be used to indicate that the body or body part is a fragment of a larger message. Three subfields must be specified in the content-type field: The first is a unique identifier, as close to a world-unique identifier as possible, to be used to match the parts together. (In general, the identifier can be similar to a message-id; if placed in double quotes, it can be any message-id, in accordance with the BNF for "parameter" given earlier in this memo.) The second, an integer, is the part number. The third, another integer, is the total number of parts. Thus, part 2 of a 3-part message might have the following header field: Content-type: Message/Partial; "oc=jpbe0M2Yt4s@thumper.bellcore.com; 2; 3 " When the parts of a message broken up in this manner are put together, the result is a complete RFC-822 format message, which may have its own Content-type header field, and thus may contain any other data type. (EXPLANATION: The purpose of the MESSAGE/PARTIAL type is to allow large objects to be delivered as several separate pieces of mail and automatically reassembled by the receiving user agent. This may be desirable when intermediate transport agents limit the size of messages that can be sent.) Additionally, all the character set subtypes of text are defined as subtypes of "message." If a character set subtype is given, it applies to the bodies, though not the names, of each of the encapsulated message's header fields except the Content-XXX header fields, which must be entirely in US- ASCII. Thus it can be used to represent address and subject information in non-ASCII character sets. The character set subtype does NOT apply to the body of the encapsulated message. Thus, to encapsulate a message with non-ASCII characters in both the header fields and in the body, you would need something like the following: From: <ASCII form> 30 Internet Message Body Format INTERNET DRAFT Subject: <ASCII form> Content-type: message/iso-8859-2 From: <iso-8859-2-form> Subject: <iso-8859-2-form> Content-type: text/iso-8859-2 Message body in iso-8859-2 character set. INTERNET DRAFT Internet Message Body Format 31 5.5 The Binary Content-Type A content-type of "binary" may be used to Indicate that the body or body part is binary data. A subtype may be specified, but none are defined here. The parameters for type binary are a set of attribute/value pairs, of the form "NAME=VALUE", separated by the usual semicolons. The set of possible attributes to be defined includes, but is not limited to: NAME -- a suggested name for the binary data as a file. TYPE -- the type of binary data CONVERSIONS -- the set of operations that have been performed on the data before putting it in the mail (and before any Content-Transfer-Encoding that might have been applied). If multiple conversions have occurred, they should be specified in the order they were applied, and separated by commas. The values for these attributes are left undefined at present, but may require specification in the future. An example of a common (though discouraged) usage might be: Content-type: binary; name=foo.tar.Z.uu; type=tar; "conversions=compress,uuencode" However, the use of such mechanisms as uuencode and compress is explicitly discouraged, in favor of the Content- Transfer-Encoding mechanism, which is both more standardized and more portable across mail boundaries. The recommended action for an implementation that receives binary mail of an unrecognized type is to simply offer to put the data in a file, with any Content-Transfer-Encoding undone, or perhaps to use it as input to a user-specified process. To reduce the danger of transmitting rogue programs through the mail, it is strongly recommended that implementations NOT implement a path-search mechanism whereby an arbitrary program named in the Content-type header field (e.g. the "type=" subfield of a binary content-type) is found and executed using the mail body as input. The recommended action for an implementation that receives binary mail of an unrecognized type is to simply decode any Content-Transfer-Encoding and put the data in a file for the end-user. Among the subtypes that have been suggested as suitable subtypes of "binary" are such document representation formats as "DVI" and "ODA". 32 Internet Message Body Format INTERNET DRAFT 5.6 The Application Content-Type Value The "application" content-type is to be used for mail-based applications. The notion of mail-based application is an application that defines a standard format for representing intermediate data that is to be manipulated by cooperating user agents. For example, a meeting scheduler might define a standard representation for information about proposed meeting dates. An intelligent user agent would use this information to conduct a dialog with the user, and might then send further more based on that dialog. Such applications may be defined as subtypes of the "application" content-type. There is no default subtype for application, and this memo defines only one subtype, the "external-reference" subtype. The External-Reference subtype indicates that the body or body part is not included, presumably because too much data is involved for the underlying mail transport mechanism to handle. The subfields are, as in the case of the "binary" content-type, attribute-value pairs. In this case, the subfields describe a mechanism for accessing the external binary data. The set of possible attributes includes, but is not limited to: FILENAME -- The name of a file that contains the external data. SITE -- one or more domain names, comma separated, of machines that are known to have access to the data file. Asterisks may be used for wildcard matching to a part of a domain name, such as "*.bellcore.com", while a single asterisk may be used to indicate a file that is expected to be universally available, e.g. via a global file system. REAL-TYPE -- The real content-type of the data, once retrieved. EXPIRATION -- The date (in RFC 822 date syntax) after which the existence of the external data is not guaranteed. With the emerging possibility of very wide-area file systems, it becomes very hard to know in advance the set of machines where a file will and will not be accessible directly from the file system. Therefore it makes sense to provide both a file name, to be tried directly, and the name of one or more sites from which the file is known to be accessible. An implementation can try to retrieve remote files using FTP or any other protocol, using anonymous file retrieval or prompting the user for the necessary name and INTERNET DRAFT Internet Message Body Format 33 password. However, the external-reference mechanism is not intended to be limited to file retrieval. One can imagine, for example, using a LISTSERV mechanism, or using unique identifiers and a video server for external references to video clips. However, this memo explicitly defines only the FILENAME and SITE attributes for retrieval purposes, as this is the only retrieval method that is currently widely applicable. Other attributes may be defined as needed. The "REAL-TYPE" attribute may be used to specify a new content-type header field to be applied to the data once retrieved, as the data are assumed to be only the body of a message, not including any header information. Note that because of the syntax of parameters, they may be quoted by enclosing an entire parameter in double quotes. Thus an external reference to an image in G3FAX format might have the following content-type header field: Content-Type: application/external-reference; name=/usr/local/images/contact.g3; site=thumper.bellcore.com; real-type=image/g3fax; expiration="Fri, 14 Jun 1991 19:13:14 -0400 (EDT)" If a message is of content-type "application/external- reference", then the actual body of the message is ignored. The distinction between the application and binary content- types is more a difference of intent than of syntax. The application content-type is used to indicate data that are intended to be interpreted by a mail-based user agent of some sort. The binary content-type is intended for the transport of arbitrary binary data, typically data that are used independently of a mail system, and for which mail transport is used as a convenient alternative to other file and data transport mechanisms. 34 Internet Message Body Format INTERNET DRAFT 5.7 The Audio, Image, and Video Content-Type Values This memo defines several morecontent-type values that are defined only incompletely here, and await further practical experience before their values can be more completely specified. These are clearly experimental in nature, and are partially defined here in order to encourage experimenters to move in a common direction, regcognizing that future additional standardization will be needed. AUDIO -- Indicates that the body or body part contains audio data. The subtype specifies the audio representation format. It is expected that such subtypes will be defined by future standards. In the meantime, vendor formats may be marked by subtypes such as "audio/x-sun", "audio/x-next", and "audio/x-mac". IMAGE --Indicates that the body or body part contains an image. The subtype names the specific image format. A few such case insensitive values are "G3Fax" for Group Three Fax, "jpeg" for the JPEG format, and "pbm", "pgm", and "ppm" for the "portable bitmap" formats for black and white, grey scale, or color images. VIDEO -- Indicates that the body or body part contains a time-varying-picture image, possibly with color and coordinated sound. The term "video" is used extremely generically, rather than with reference to any particular technology or format, and is not meant to preclude subtypes such as animated drawings encoded compactly. The subtype and possible parameter values are left undefined by this memo. 5.8 Experimental ("X-") Content-Type Values A content-type value beginning with the characters "X-" and not defined here or in another RFC is a private value, to be used by consenting mail systems by mutual agreement. Any format without a rigorous and public definition should be named with an "X-" prefix. Older versions of the widely-used Andrew system use the "X-BE2" name, so new systems should probably choose a different name. INTERNET DRAFT Internet Message Body Format 35 6 Conformance With this Memo The mechanisms described in this memo are open-ended. It is definitely not expected that all implementations will implement all of the content-types described, nor that they will all share the same extensions. In order to promote interoperability, however, it is useful to define the concept of "XXXX-Conformance" to define a certain level of implementation that allows the useful interworking of messages with content that differs from US ASCII text. In this section, we specify the requirements for such conformance. An XXXX-conformant mail user agent must: 1. Recognize the Content-Transfer-Encoding header field, and decode data encoded with either the quoted-printable or base64 implementations. (If a compressed encoding is ever agreed to, it should also become part of all conformant user agents.) 2. Recognize and interpret the Content-type header field, and avoid showing an unsuspecting user raw data that has a content-type field other than text. 3. Explicitly handle the following content-type values, to at least the following extents: Text: -- Recognize and display "text" mail with the subtype "US-ASCII." -- Recognize other subtypes at least to the extent of being able to inform the user about what character set the message uses. -- Recognize the "ISO-8859-1" subtype to the extent of being able to display those characters that are common to ISO-8859-1 and US-ASCII. -- Never compose text mail without including a "Content-type" header specifying the appropriate subtype (character set). Text-plus: -- For unrecognized subtypes, show or offer to show the user the "raw" version of the data. An ability to convert "text-plus/richtext" to plain text is encouraged, but not required for conformance. Message: 36 Internet Message Body Format INTERNET DRAFT --Recognize and display at least the default (simple) encapsulation. Multipart: -- Recognize and display the default (mixed) subtype, although parallel parts may be serialized. -- Treat any unrecognized subtypes as if they were "mixed". Binary: -- Offer the ability to remove any Content-Transfer-Encoding and put the resulting information in a user file. 4. Upon encountering any unrecognized content- type, an implementation should treat it as if it had a content-type of "binary" with no parameter sub-arguments. How such data is handled is up to an implementation, but likely options for handling such unrecognized data include offering the user to write it into a file (decoded from its mail transport format) or offering the user to name a program to which the decoded data should be passed as input. Unrecognized predefined types, which might include audio, image, video, or application, should also be treated in this way. A user agent that meets the above conditions is said to be XXXX-conformant. The meaning of this phrase is that it is assumed to be "safe" to send virtually any kind of properly-marked data to users of such mail systems, because they will at least be able to treat the data as undifferentiated binary, and will not simply splash it onto the screen of unsuspecting users. Of course, there is another sense in which it is always "safe" to send XXXX- conformant format data, which is that it such data will not break or be broken by any known systems that are conformant with RFC 821 and RFC 822. User agents that are XXXX- conformant have the additional guarantee that the user will not be shown data that were never intended to be viewed as text. INTERNET DRAFT Internet Message Body Format 37 Appendix I -- Guidelines For Sending Data Via Email Internet email is not yet a perfect, homogenous system. Mail may become corrupted at several stages in its travel to a final destination. Specifically, email sent throughout the Internet may travel across many networking technologies. Many networking and mail technologies do not support the full functionality possible in the SMTP transport environment. Mail traversing these systems is likely to be modified in such a way that it can be transported. There exist many widely deployed non-conformant MTA's in the Internet. These MTA's, speaking the SMTP protocol, alter messages on the fly to take advantage of internal data structure of the hosts they are implemented on, or are just plain broken. The following guidelines may be useful to anyone devising a data format (content-type) that will survive the widest range of networking technologies and known broken MTA's unscathed. Note that anything encoded in the base64 encoding will satisfy these rules, but that some well-known mechanisms, notably the UNIX uuencode facility, will not. Note also that anything encoded in the Quoted-Printable encoding will survive most gateways intact, but possibly not gateways to systems that use the EBCDIC character set. (1) Delimiters other than CR-LF pairs may be used in the local representation of a message on some systems. The persistence of CR-LF pairs should not be relied on. (2) Isolated CR and LF characters are not well tolerated in general; they may be lost or converted to delimiters on some systems, and hence should not be relied on. (3) TAB characters may be misinterpreted or may be automatically converted to variable numbers of spaces. This is unavoidable in some environments, notably those not based on the ASCII character set. Such conversion is STRONGLY DISCOURAGED, but it may occur, and users of US-ASCII format should not rely on the persistence of TAB characters. (4) Lines longer than 78 characters may be wrapped or truncated in some environments. Line wrapping and line truncation are STRONGLY DISCOURAGED, but unavoidable in some cases. Applications which depend on lines not being wrapped should use mechanisms other than unencoded US-ASCII bodyparts to transmit messages. 38 Internet Message Body Format INTERNET DRAFT (5) Trailing "white space" characters (SPACE, TAB, etc.) on a line may be discarded by some transport agents, while other transport agents may pad lines with these characters so that all lines in a mail file are of equal length. The persistence of trailing white space, therefore, should not be relied on. (6) Many mail domains use variations on the ASCII character set, or use character sets such as EBCDIC which contain most but not all of the US- ASCII characters. The correct translation of characters not in the "invariant" set cannot be depended on across character converting gateways. For example, this situation is a problem when sending uuencoded information across BITNET, an EBCDIC system. Similar problems can occur without crossing a gateway, since many Internet hosts use character sets other than ASCII internally. In particular, the only characters that are known to be consistent across all gateways are the 62 characters that correspond to the upper and lower case letters A-Z and a-z, the 10 digits 0-9, and the following eleven special characters: "'" (ASCII code 39) "(" (ASCII code 40) ")" (ASCII code 41) "+" (ASCII code 43) "," (ASCII code 44) "-" (ASCII code 45) "." (ASCII code 46) "/" (ASCII code 47) ":" (ASCII code 58) "=" (ASCII code 61) "?" (ASCII code 63) A maximally portable mail representation, such as the base64 encoding, will confine itself to relatively short lines of text in which the only meaningful characters taken from this set of 73 characters. Please note that the above list is NOT a list of recommended practices for MTA's. RFC 821 MTA's are prohibited from altering the character of white space, or wraping long lines. These BAD and illegal practices are know to occur on established networks, and implementions should be robust in dealing with the bad effects they can cause. INTERNET DRAFT Internet Message Body Format 39 Appendix II -- A Complex Multipart Example What follows is the outline of a complex multipart message. This message has three parts to be displayed serially: an introductory plain text part, an embedded multipart message, and a closing encapsulated text message in a non-ASCII character set. The embedded multipart message has two parts to be displayed in parallel, a picture and an audio fragment. From: ... Subject: ... Content-type: multipart; 1-s; tweedledum This is a multipart message. Since I've not specified another character set, this "prefix" area is in US ASCII. --tweedledum ...Some more text appears here... [Note that the preceding blank line means no header fields were given and this is text, with charset US ASCII.] --tweedledum Content-type: multipart; 1-p; tweedledee This is a multipart message. If you are reading this text, you might want to consider changing to a user agent that understands how to properly display multipart messages. --tweedledee Content-type: x-NeXT Content-Transfer-Encoding: base64 ... base64-encoded NeXT-format audio data goes here.... --tweedledee Content-type: image/G3FAX Content-Transfer-Encoding: Base64 ... base64-encoded FAX data goes here.... --tweedledee-- --tweedledum Content-type: message/ISO-8859-1 From: (name in ISO-8859-1) Subject: (subject in ISO-8859-1) Content-type: Text/ISO-8859-1 Content-Transfer-Encoding: Quoted-printable ... Closing text in ISO-8859-1 goes here ... --tweedledum-- 40 Internet Message Body Format INTERNET DRAFT Appendix III -- The US-ASCII Character Set The following table explicitly defines the default character set for Internet mail, "US-ASCII": 0 nul @ 64 Commercial at 1 soh A 65 Latin capital letter a 2 stx B 66 Latin capital letter b 3 etx C 67 Latin capital letter c 4 eot D 68 Latin capital letter d 5 enq E 69 Latin capital letter e 6 ack F 70 Latin capital letter f 7 bel G 71 Latin capital letter g 8 bs H 72 Latin capital letter h 9 ht I 73 Latin capital letter i 10 lf J 74 Latin capital letter j 11 vt K 75 Latin capital letter k 12 np L 76 Latin capital letter l 13 cr M 77 Latin capital letter m 14 so N 78 Latin capital letter n 15 si O 79 Latin capital letter o 16 dle P 80 Latin capital letter p 17 dc1 Q 81 Latin capital letter q 18 dc2 R 82 Latin capital letter r 19 dc3 S 83 Latin capital letter s 20 dc4 T 84 Latin capital letter t 21 nak U 85 Latin capital letter u 22 syn V 86 Latin capital letter v 23 etb W 87 Latin capital letter w 24 can X 88 Latin capital letter x 25 em Y 89 Latin capital letter y 26 sub Z 90 Latin capital letter z 27 esc [ 91 Left square bracket 28 fs \ 92 Reverse solidus 29 gs ] 93 Right square bracket 30 rs ^ 94 Circumflex accent 31 us _ 95 Low line 32 Space ` 96 Grave accent ! 33 Exclamation mark a 97 Latin small letter a " 34 Quotation mark b 98 Latin small letter b # 35 Number sign c 99 Latin small letter c $ 36 Dollar sign d 100 Latin small letter d % 37 Percent sign e 101 Latin small letter e & 38 Ampersand f 102 Latin small letter f ' 39 Apostrophe g 103 Latin small letter g ( 40 Left parenthesis h 104 Latin small letter h ) 41 Right parenthesis i 105 Latin small letter i * 42 Asterisk j 106 Latin small letter j + 43 Plus sign k 107 Latin small letter k , 44 Comma l 108 Latin small letter l - 45 Hyphen, minus sign m 109 Latin small letter m INTERNET DRAFT Internet Message Body Format 41 . 46 Full stop n 110 Latin small letter n / 47 Solidus o 111 Latin small letter o 0 48 Digit zero p 112 Latin small letter p 1 49 Digit one q 113 Latin small letter q 2 50 Digit two r 114 Latin small letter r 3 51 Digit three s 115 Latin small letter s 4 52 Digit four t 116 Latin small letter t 5 53 Digit five u 117 Latin small letter u 6 54 Digit six v 118 Latin small letter v 7 55 Digit seven w 119 Latin small letter w 8 56 Digit eight x 120 Latin small letter x 9 57 Digit nine y 121 Latin small letter y : 58 Colon z 122 Latin small letter z ; 59 Semicolon { 123 Left curly bracket < 60 Less-than sign | 124 Vertical line = 61 Equals sign } 125 Right curly bracket > 62 Greater-than sign ~ 126 Tilde ? 63 Question mark 127 Del 42 Internet Message Body Format INTERNET DRAFT Summary Using the Content-Type and Content-Transfer-Encoding header fields, it is possible to include, in a standardized way, arbitrary types of data objects with RFC 822 conformant mail messages. No restrictions imposed by either RFC 821 or RFC 822 are broken, and care has been taken to avoid problems caused by additional restrictions imposed by the characteristics of some Internet mail transport mechanisms (see Appendix I). The "multipart" and "message" content- types allow mixing and heirarchical structuring of objects of different types in a single message. Further content-tyes allow a standardized mechanism for tagging messages or mesage parts as audio, image, or several other kinds of data. Additional optional header fields provide conventional mechanisms for certain extensions deemed desirable by many implementors. Finally, a number of useful content-types are defined for general use by consenting user agents. Contacts For more information, the authors of this document may be contacted via Internet mail: Nathaniel Borenstein <nsb@thumper.bellcore.com> Ned Freed <ned@innosoft.com> Acknowledgements This memo is the result of the collective effort of a large number of people, at several IETF meetings and on the IETF- SMTP and IETF-822 mailing lists. Although any enumeration seems doomed to suffer from egregious omissions, the following are among the many contributors to this effort: Harald Alvestrand, Randall Atkinson, Kevin Carosso, Mark Crispin, Dave Crocker, Terry Crowley, Walt Daniels, Frank Dawson, Hitoshi Doi, Kevin Donnelly, Johnny Eriksson, Craig Everhart, Roger Fajman, Alain Fontaine, Philip Gladstone, Thomas Gordon, Phill Gross, David Herron, Bruce Howard, Bill Janssen, Risto Kankkunen, Phil Karn, Tim Kehres, Neil Katin, Steve Kille, Anders Klemets, John Klensin, Valdis Kletniek, Stev Knowles, Bob Kummerfeld, Vincent Lau, Timo Lehtinen, John MacMillan, Rick McGowan, Leo Mclaughlin, Goli Montaser-Kohsari, Keith Moore, Mark Needleman, John Noerenberg, Mats Ohrman, David J. Pepper, Jonathan Rosenberg, Jan Rynning, Mark Sherman, Keld Simonsen, Bob Smart, Einar Stefferud, Michael Stein, Peter Svanberg, Steve Uhler, Stuart Vance, Erik van der Poel, Peter Vanderbilt, Greg Vaudreuil, Brian Wideen, Glenn Wright, and David Zimmerman. The authors apologize for any omissions from this list, which were certainly unintentional. INTERNET DRAFT Internet Message Body Format 43 References [REF-ISO646] International Standard--Information Processing--ISO 7-bit coded character set for information interchange, ISO 646:1983. [REF-ISO-2022] International Standard--Information Processing--ISO 7-bit and 8-bit coded character sets--Code extension techniques, ISO 2022:1986. [REF-ANSI] Coded Character Set--7-Bit American Standard Code for Information Interchange, ANSI X3.4-1986. [REF-X400] Schicker, Pietro, "Message Handling Systems, X.400", Message Handling Systems and Distributed Applications, E. Stefferud, O-j. Jacobsen, and P. Schicker, eds., North-Holland, 1989, pp. 3-41. [RFC-821] Postel, J.B. Simple Mail Transfer Protocol. August, 1982, Network Information Center, RFC-821. [RFC-822] Crocker, D. Standard for the format of ARPA Internet text messages. August, 1982, Network Information Center, RFC-822. [RFC-934] Rose, M.T.; Stefferud, E.A. Proposed standard for message encapsulation. January, 1985, Network Information Center, RFC-934. [RFC-1049] Sirbu, M.A. Content-type header field for Internet messages. March, 1988, Network Information Center, RFC-1049. [RFC-1113] Linn, J. Privacy enhancement for Internet electronic mail: Part I - message encipherment and authentication procedures [Draft]. August, 1989, Network Information Center, RFC-1113. [RFC-1154] Robinson, D.; Ullmann, R. Encoding header field for internet messages. April, 1990, Network Information Center, RFC-1154. [REF-RFC-QR] Prindeville, Phillipe-Andrew', and Keld Simonsen, "A Portable, Extensible Message Encoding Format for Alphabetic Scripts", Internet RFC, in preparation. [REF-ISO-10646] Draft International Standard -- Information Technology -- Universal Coded Character Set (UCS), ISO/IEC DIS 10646:1990. [REF-ISO-8859] ********** 44 Internet Message Body Format INTERNET DRAFT Table of Contents 1 Introduction......................................... 3 2 The Content-Type Header Field........................ 5 3 The Content-Transfer-Encoding Header Field........... 8 3.1 Quoted-Printable Content-Transfer-Encoding........... 10 3.2 Base64 Content-Transfer-Encoding..................... 12 4 Additional Optional Content- Header Fields........... 16 4.1 Optional Content-ID Header Field..................... 16 4.2 Optional Content-Description Header Field............ 16 5 The Predefined Content-type Values................... 17 5.1 The TEXT Content-type and the US-ASCII Character Set. 17 20 The ................................................. 24 The ................................................. 5.4 The Message Content-Type............................. 29 5.5 The Binary Content-Type.............................. 31 5.6 The Application Content-Type Value................... 32 5.7 The Audio, Image, and Video Content-Type Values...... 34 34 Experimental (....................................... 6 Conformance With this Memo........................... 35 Appendix I -- Guidelines For Sending Data Via Email.. 37 Appendix II -- A Complex Multipart Example........... 39 Appendix III -- The US-ASCII Character Set........... 40 Summary.............................................. 42 Contacts............................................. 42 Acknowledgements..................................... 42 References........................................... 43