[comp.std.internat] A full solution

bas+@andrew.cmu.edu (Bruce Sherwood) (08/27/87)

I'm distressed by the nature of the new ISO Latin scheme (ISO 8859-1).  There
already appeared some time ago ISO 6937 which covers nearly ALL languages
which use Roman-letter alphabets (with the exception of Vietnamese), whereas
the new ISO 8859 covers only some languages.  ISO 8859-1 seems a very major
step backwards.  The processing of non-English text in computer systems has
been plagued by one half-solution after another.  Just when things were
looking up (with ISO 6937), along comes a new and different standard which is
much more limited in scope.

ISO 6937, like ISO 8859, uses 8-bit codes to provide an additional 96
characters.  About 30 of these are special characters not formable from
diacritics (e.g., Icelandic thorn, or undotted i).  There is a full set of
diacritics, which precede the letter they modify.  You can think of them as
non-spacing characters (so that the following letter prints on top of the
diacritic).  A better way to think of them however is as "alert" codes,
specifying that it and the following code form a 16-bit specification for a
character.  The actual dot pattern may be formed by superposition, or it may
be stored in a separate "rendering" set (to make a better-looking character
than could be produced by superimposing a letter and a separate diacritic).
The rest of the 96 extra characters are punctuation (such as inverted
exclamation and question for Spanish), some math symbols, etc.  In fact, the
first 32 characters of ISO 8859 are nearly identical to the first 32 8-bit
characters of ISO 6937.

There is something exceedingly strange about ISO 8859-1.  Appendix A lists
countries rather than languages for which the standard is valid.  This is
awfully peculiar.  For example, Spain is in the list.  But Catalan is a very
important language in Spain, and in fact it is the language of the
technologically most developed part of the country (the region containing
Barcelona).  And it appears that ISO 8859-1 does not handle Catalan (dotted
L)!  And I note that the ligatured ij of Dutch is missing.  And the
"apostrophe-n" of Afrikaans.  And neither 8859-1 nor 8859-2 can handle
Esperanto (a language which I use a lot).  The ISO 6937 scheme handles all of
these languages.

Here is a quote from a discussion of ISO 8859 (Tim Lasko,
lasko@video.dec.com, DEC, writing in comp.std.internat):  "We (the U.S., ASC
X3L2) realized a bit too late that certain characters needed to properly
represent the Welsh language (w and y with circumflex) weren't conveniently
available in any of the ISO 8859 sets, and tried to change Part 4 to include
them.  However, there was neither room nor consensus within the ISO committee
to include these, so these too do not exist in any of the ISO 8859 code
tables.  (Arguably, the BSI should have been looking out for the requirements
of Welsh, but for a number of reasons that I choose not to go into here, they
did not.)"

This case of Welsh is another sad example of ISO 8859 catering to countries
rather than to languages...  And even in the face of the excellent work of
ISO 6937, which contains a listing of the diacritic needs for 41 languages,
including Welsh, which is listed as needing w any y with circumflex.  I can't
understand why the people working on 8859 didn't check their work against the
comprehensive list given in 6937.  The 41 languages covered by 6937 are
Afrikaans, Albanian, Basque, Breton, Catalan, Croat, Czech, Danish, Dutch,
English, Esperanto, Estonian, Faroese, Finnish, French, Frisian, Galician,
German, Greenlandic, Hungarian, Icelandic, Irish, Italian, Lapp, Latvian,
Lithuanian, Maltese, Norwegian, Occitan, Polish, Portuguese, Rhaeto-Romanic,
Romanian, Scots Gaelic, Slovak, Slovene, Sorbian, Spanish, Swedish, Turkish,
and Welsh.

It seems most unfortunate in this day of laser printers and fancy displays
and sophisticated window managers to implement yet another half solution, one
which is only sort of valid for some region of the globe, and even there is
valid only for "national" rather than regional languages.

The extensive multi-lingual Xerox scheme contains 6937 as one of the basic
sets.  The AT&T Videotex scheme is based on 6937.  The basic coding scheme in
PostScript is a subset of 6937 (it contains all of the 6937 diacritics, and
some of the 6937 special characters such as AE, in the same slots as 6937,
but it leaves many slots unused).  It may be that suddently 6937 is out of
favor because it "didn't fully catch on," but it seems tragic to back off
from a full solution.

Perhaps you would be interested in what we plan to do in Base Environment 2
(BE2) of the Andrew system under development at the Information Technology
Center at Carnegie Mellon.  Much of the design is due to Tomas Centerlind of
Sweden, who worked here this summer.  Since we don't do Unix operating-system
development here, we feel that for now we have to stay with a 7-bit external
representation (on disk, in mail, etc.).  In the text datastream AE will be
represented by \.DigraphAE{}, and the Spanish n-tilde will be represented by
\.Tilde{n}.  In memory the AE in a BE2 document will be the ISO 6937 8-bit
code for AE.  The n-tilde will be represented in the document by the code
255, indicating that one must look in the accompanying environment tree (used
also for representing styles such as italic) for a 32-bit character code.
This "longchar" has the form 8/0, 8/0, 8/tilde, 8/n.  The upper bytes are for
expansion and indicate what character sets the lower two bytes refer to, and
the lower bytes are ISO 6937 for the diacritic and letter.  The reason for
putting the tilde-n out of line is to simplify various aspects of BE2 text
manipulation, and to make multi-byte characters nevertheless be accessed by
the programmer as single entities.

While editing, you can choose a system- or user-defined keyboard, with
associated key bindings.  You can have the keyboard displayed at the bottom
of the editing window and type with the mouse if you want.  Much of the
keyboard redefinition machinery has been built, but there are pieces of BE2
which have not yet been tweaked to make it all work.

Bruce Sherwood
Center for Design of Educational Computing
     and Information Technology Center
Carnegie Mellon University

hrs@homxb.UUCP (H.SILBIGER) (09/01/87)

> ISO 6937, like ISO 8859, uses 8-bit codes to provide an additional 96
> characters.  About 30 of these are special characters not formable from
> diacritics (e.g., Icelandic thorn, or undotted i).  There is a full set of
> diacritics, which precede the letter they modify.  You can think of them as
> non-spacing characters (so that the following letter prints on top of the
> diacritic).  A better way to think of them however is as "alert" codes,
> specifying that it and the following code form a 16-bit specification for a
> character.  The actual dot pattern may be formed by superposition, or it may
> be stored in a separate "rendering" set (to make a better-looking character
> than could be produced by superimposing a letter and a separate diacritic).
> The rest of the 96 extra characters are punctuation (such as inverted
> exclamation and question for Spanish), some math symbols, etc.  In fact, the
> first 32 characters of ISO 8859 are nearly identical to the first 32 8-bit
> characters of ISO 6937.
> 
ISO 6937 is emerging as the standard code set for communication.  ISO and CCITT
standards on Document Communication all specify this set.  The CCITT
equivalent is Recommendation T.61.

Herman Sibiger ...!ihnp4!homxb!hrs

frisk@askja.UUCP (Fridrik Skulason) (09/05/87)

On the subject of ISO 6937 versus ISO 8859.

6937 may be better than 8859 for some purposes (communication that is),
but as a standard character set for terminals it is useless.

The reason ...

Simple. Most existing software packages assume that (1 char in text =
1 char on screen). Adapting the software to work with a full 256 character
set instead of ASCII may be difficult, but it's still just a minor problem
compared to make 6937 work as a standard computer/terminal character set.

Here in Iceland ISO 8859 is heavily used. It is the only useful standard
that supports all chacters in our alphabet. (ISO 8859/1 - Western Europe that
is - the last time I looked ISO 8859/2 (or was that /3) - for Northern Europe
did not!)

It is available on some personal computers here (Amiga/Atari), on some
terminals (ADM in particular), and a couple of printers.

In fact - just a few days ago we decided that the University would not
buy or support a single piece of equipment that did not support ISO 8859/1.
 
(Meaning - in my case - that I have to work overtime to modify the
$#?!$#?!$!!! terminal emulation on the Macs. (I have already fixed the PCs))

So - in our case at least - 8859/1 is here to stay..


-- 
Fridrik Skulason  Univ. of Iceland, Computing Center
       UUCP  frisk@rhi                                      BIX  frisk

                     "This line intentionally left blank"

bas+@andrew.cmu.edu (Bruce Sherwood) (09/11/87)

A couple people wrote personally to me asking me to send them the ISO 6937
standard, or how to order it.

ISO 6937/2-1983 (E) can be ordered from 

American National Standards Institute
Department SD
1430 Broadway
New York NY 10018

I can't seem to find the cost, but I think it is about $35 for the paper
including shipping and handling ("Price based on 37 pages", according to the
cover sheet).  Expensive -- but I may be wrong about the exact price.

Here is the gist of ISO 6937.  It contains standard old ASCII in the slots 32
thru 126.  In the upper (8-bit) slots from 161 thru 254 we have the list
shown at the end of this note, divided into groups of 16 slots, with "---"
indicating "not assigned". 

The key features are a full set of diacritic codes and a full set of letters
used by roman-letter alphabets which aren't in base ASCII and can't be made
with diacritics.  Together these enable handling 41 different languages,
probably constituting almost all roman-letter scripts other than Vietnamese.
The codes in the column of diacritics function as escape codes, indicating
that it plus the following code constitute a 16-bit specification of a
complex character.  That complex character may be rendered by superimposing a
letter and a diacritic, or some implementations may choose to have a separate
"rendering" set of images in which the diacritic is already on the letter
(this gives higher-quality print possibilities, of course).

Note that altho the diacritic code precedes the associated letter code, a
decent computer system should allow the user to type a diacritic key AFTER
the letter.  Having to type it BEFORE is a bad holdover from mechanical
typewriters, which could handle diacritics only by implementing a "dead key"
which didn't advance the platen.  Linguistically however it makes no sense to
type the diacritic before typing the letter, and it should be the job of the
input routine to turn the bytes around in memory.

InvertedExclamationPoint
Cent
Pound
Dollar
Yen
---
Section
---
LeftSingleQuote
LetfDoubleQuote
LeftDoubleGuillemet
LeftArrow
UpArrow
RightArrow
DownArrow

Degree
Plus/Minus
SuperTwo
SuperThree
Multiply
Micro
Paragraph
CenteredDot
Divide
RightSingleQuote
RightDoubleQuote
RightDoubleGuillemet
OneQuarter
OneHalf
ThreeQuarters
InvertedQuestionMark

---
Grave
Acute
Circumflex
Tilde
Macron
Breve
OverDot
Diaeresis
---
OverRing
Cedilla
Underline
DoubleAcute
Ogonek
Hachek

HorizontalBar
SuperOne
Registered
Copyright
Trademark
MusicNote
---
---
---
---
---
---
OneEighth
ThreeEighths
FiveEighths
SevenEighths

Ohm
UppercaseDigraphAE
UppercaseStrokeD
OrdinalA
UppercaseStrokeH
LowercaseDotlessJ
UppercaseDigraphIJ
UppercaseMiddleDotL
UppercaseStrokeL
UppercaseSlashO
UppercaseDigraphOE
OrdinalO
UppercaseThorn
UppercaseStrokeT
UppercaseEngma

LowercaseApostropheN
LowercaseGreenlandicK
LowercaseDigraphAE
LowercaseStrokeD
LowercaseEth
LowercaseStrokeH
LowercaseDotlessI
LowercaseDigraphIJ
LowercaseDotL
LowercaseStrokeL
LowercaseSlashO
LowercaseDigraphOE
LowercaseDoubleS
LowercaseThorn
LowercaseStrokeT
LowercaseEngma

Bruce Sherwood
Center for Design of Educational Computing
   and Information Technology Center
Carnegie Mellon University