fdc@WATSUN.CC.COLUMBIA.EDU (Frank da Cruz) (03/27/89)
We are looking for information on any standards -- corporate, de-facto, national, international, or lack of any of these -- for storage (as opposed to transmission) of textual data that contains a mixture of alphabets, for example, Roman, Hebrew, Arabic, Cyrillic, Greek, Japanese, Chinese, Korean, Cherokee, ... Commonly-used computer alphabets today include the well-known 7-bit US ASCII and its "national" variations (UK ASCII, ISO-646 with various national characters substituted for ASCII brackets, etc), the ISO 8859 family of 8-bit alphabets (Latin 1-5, Cyrillic, Hebrew, Arabic, Greek, etc), the several Japanese alphabets (JIS X 0201, JIS X 0208, etc), and so on. For transmission of text composed of more than one alphabet, we convert from local storage conventions to the international standard alphabets (e.g. ISO or JIS) and then use the mechanisms and escape sequences defined in ISO 4873 and ISO 2022 (or JIS X 0202) for switching between them. But for storing mixed-alphabet text within a computer file, what do we have? We have the "corporate standard" alphabets, such as the EBCDIC and ASCII "code pages" used on IBM mainframes and PCs, DEC Kanji, the Xerox character sets, the Macintosh character sets, and so on... Does anyone know anything about "8-bit UNIX" -- the extension of UNIX to languages other than English? How about national versions of VAX/VMS, like French, German, or Hebrew VMS? Is it true that most multi-language text files are those created by word processing programs, and are therefore in special proprietary or private formats, which include not only mechanisms for alphabet switching, but also special effects like font selection, highlighting, page formatting, etc? What are some popular multi-language word processing programs (for the PC, PS/2, Macintosh, etc), and what do their file formats look like? How difficult is it to separate the alphabet selection from the page formatting? This query is connected with an effort to extend the Kermit file transfer protocol to include a transfer syntax for multi-language text. This transfer syntax will probably wind up using the ISO 4873 and 2022 mechanisms for switching among ISO 8859 alphabets, with similar mechanisms applied to Japanese and other multi-byte character sets. Meanwhile, real-world examples of multi-language file formats are needed to test the proposed (and evolving) Kermit file transfer syntax against. Please respond to any of the following addresses: cmg@watsun.cc.columbia.edu KERMIT@CUVMA.BITNET fdc@watsun.cc.columbia.edu FDCCU@CUVMA.BITNET If you are interested in participating in the ensuing discussion, also ask to be added to the "isokermit" mailing list. Thanks for your help! Christine Gianone Frank da Cruz cmg@watsun.cc.columbia.edu fdc@watsun.cc.columbia.edu KERMIT@CUVMA.BITNET FDCCU@CUVMA.BITNET