[comp.protocols.misc] Multi-alphabet text files

fdc@WATSUN.CC.COLUMBIA.EDU (Frank da Cruz) (03/27/89)

We are looking for information on any standards -- corporate, de-facto,
national, international, or lack of any of these -- for storage (as opposed to
transmission) of textual data that contains a mixture of alphabets, for
example, Roman, Hebrew, Arabic, Cyrillic, Greek, Japanese, Chinese, Korean,
Cherokee, ...

Commonly-used computer alphabets today include the well-known 7-bit US ASCII
and its "national" variations (UK ASCII, ISO-646 with various national
characters substituted for ASCII brackets, etc), the ISO 8859 family of 8-bit
alphabets (Latin 1-5, Cyrillic, Hebrew, Arabic, Greek, etc), the several
Japanese alphabets (JIS X 0201, JIS X 0208, etc), and so on.

For transmission of text composed of more than one alphabet, we convert from
local storage conventions to the international standard alphabets (e.g. ISO
or JIS) and then use the mechanisms and escape sequences defined in ISO 4873
and ISO 2022 (or JIS X 0202) for switching between them.

But for storing mixed-alphabet text within a computer file, what do we have?
We have the "corporate standard" alphabets, such as the EBCDIC and ASCII "code
pages" used on IBM mainframes and PCs, DEC Kanji, the Xerox character sets,
the Macintosh character sets, and so on...  Does anyone know anything about
"8-bit UNIX" -- the extension of UNIX to languages other than English?  How
about national versions of VAX/VMS, like French, German, or Hebrew VMS?

Is it true that most multi-language text files are those created by word
processing programs, and are therefore in special proprietary or private
formats, which include not only mechanisms for alphabet switching, but also
special effects like font selection, highlighting, page formatting, etc?
What are some popular multi-language word processing programs (for the PC,
PS/2, Macintosh, etc), and what do their file formats look like?  How
difficult is it to separate the alphabet selection from the page formatting?

This query is connected with an effort to extend the Kermit file transfer
protocol to include a transfer syntax for multi-language text.  This transfer
syntax will probably wind up using the ISO 4873 and 2022 mechanisms for
switching among ISO 8859 alphabets, with similar mechanisms applied to
Japanese and other multi-byte character sets.  Meanwhile, real-world examples
of multi-language file formats are needed to test the proposed (and evolving)
Kermit file transfer syntax against.

Please respond to any of the following addresses:

  cmg@watsun.cc.columbia.edu
  KERMIT@CUVMA.BITNET
  fdc@watsun.cc.columbia.edu
  FDCCU@CUVMA.BITNET

If you are interested in participating in the ensuing discussion, also ask
to be added to the "isokermit" mailing list.  Thanks for your help!

  Christine Gianone                   Frank da Cruz
  cmg@watsun.cc.columbia.edu          fdc@watsun.cc.columbia.edu
  KERMIT@CUVMA.BITNET                 FDCCU@CUVMA.BITNET