bas+@andrew.cmu.edu (Bruce Sherwood) (08/27/87)
I'm distressed by the nature of the new ISO Latin scheme (ISO 8859-1). There already appeared some time ago ISO 6937 which covers nearly ALL languages which use Roman-letter alphabets (with the exception of Vietnamese), whereas the new ISO 8859 covers only some languages. ISO 8859-1 seems a very major step backwards. The processing of non-English text in computer systems has been plagued by one half-solution after another. Just when things were looking up (with ISO 6937), along comes a new and different standard which is much more limited in scope. ISO 6937, like ISO 8859, uses 8-bit codes to provide an additional 96 characters. About 30 of these are special characters not formable from diacritics (e.g., Icelandic thorn, or undotted i). There is a full set of diacritics, which precede the letter they modify. You can think of them as non-spacing characters (so that the following letter prints on top of the diacritic). A better way to think of them however is as "alert" codes, specifying that it and the following code form a 16-bit specification for a character. The actual dot pattern may be formed by superposition, or it may be stored in a separate "rendering" set (to make a better-looking character than could be produced by superimposing a letter and a separate diacritic). The rest of the 96 extra characters are punctuation (such as inverted exclamation and question for Spanish), some math symbols, etc. In fact, the first 32 characters of ISO 8859 are nearly identical to the first 32 8-bit characters of ISO 6937. There is something exceedingly strange about ISO 8859-1. Appendix A lists countries rather than languages for which the standard is valid. This is awfully peculiar. For example, Spain is in the list. But Catalan is a very important language in Spain, and in fact it is the language of the technologically most developed part of the country (the region containing Barcelona). And it appears that ISO 8859-1 does not handle Catalan (dotted L)! And I note that the ligatured ij of Dutch is missing. And the "apostrophe-n" of Afrikaans. And neither 8859-1 nor 8859-2 can handle Esperanto (a language which I use a lot). The ISO 6937 scheme handles all of these languages. Here is a quote from a discussion of ISO 8859 (Tim Lasko, lasko@video.dec.com, DEC, writing in comp.std.internat): "We (the U.S., ASC X3L2) realized a bit too late that certain characters needed to properly represent the Welsh language (w and y with circumflex) weren't conveniently available in any of the ISO 8859 sets, and tried to change Part 4 to include them. However, there was neither room nor consensus within the ISO committee to include these, so these too do not exist in any of the ISO 8859 code tables. (Arguably, the BSI should have been looking out for the requirements of Welsh, but for a number of reasons that I choose not to go into here, they did not.)" This case of Welsh is another sad example of ISO 8859 catering to countries rather than to languages... And even in the face of the excellent work of ISO 6937, which contains a listing of the diacritic needs for 41 languages, including Welsh, which is listed as needing w any y with circumflex. I can't understand why the people working on 8859 didn't check their work against the comprehensive list given in 6937. The 41 languages covered by 6937 are Afrikaans, Albanian, Basque, Breton, Catalan, Croat, Czech, Danish, Dutch, English, Esperanto, Estonian, Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Hungarian, Icelandic, Irish, Italian, Lapp, Latvian, Lithuanian, Maltese, Norwegian, Occitan, Polish, Portuguese, Rhaeto-Romanic, Romanian, Scots Gaelic, Slovak, Slovene, Sorbian, Spanish, Swedish, Turkish, and Welsh. It seems most unfortunate in this day of laser printers and fancy displays and sophisticated window managers to implement yet another half solution, one which is only sort of valid for some region of the globe, and even there is valid only for "national" rather than regional languages. The extensive multi-lingual Xerox scheme contains 6937 as one of the basic sets. The AT&T Videotex scheme is based on 6937. The basic coding scheme in PostScript is a subset of 6937 (it contains all of the 6937 diacritics, and some of the 6937 special characters such as AE, in the same slots as 6937, but it leaves many slots unused). It may be that suddently 6937 is out of favor because it "didn't fully catch on," but it seems tragic to back off from a full solution. Perhaps you would be interested in what we plan to do in Base Environment 2 (BE2) of the Andrew system under development at the Information Technology Center at Carnegie Mellon. Much of the design is due to Tomas Centerlind of Sweden, who worked here this summer. Since we don't do Unix operating-system development here, we feel that for now we have to stay with a 7-bit external representation (on disk, in mail, etc.). In the text datastream AE will be represented by \.DigraphAE{}, and the Spanish n-tilde will be represented by \.Tilde{n}. In memory the AE in a BE2 document will be the ISO 6937 8-bit code for AE. The n-tilde will be represented in the document by the code 255, indicating that one must look in the accompanying environment tree (used also for representing styles such as italic) for a 32-bit character code. This "longchar" has the form 8/0, 8/0, 8/tilde, 8/n. The upper bytes are for expansion and indicate what character sets the lower two bytes refer to, and the lower bytes are ISO 6937 for the diacritic and letter. The reason for putting the tilde-n out of line is to simplify various aspects of BE2 text manipulation, and to make multi-byte characters nevertheless be accessed by the programmer as single entities. While editing, you can choose a system- or user-defined keyboard, with associated key bindings. You can have the keyboard displayed at the bottom of the editing window and type with the mouse if you want. Much of the keyboard redefinition machinery has been built, but there are pieces of BE2 which have not yet been tweaked to make it all work. Bruce Sherwood Center for Design of Educational Computing and Information Technology Center Carnegie Mellon University
hrs@homxb.UUCP (H.SILBIGER) (09/01/87)
> ISO 6937, like ISO 8859, uses 8-bit codes to provide an additional 96 > characters. About 30 of these are special characters not formable from > diacritics (e.g., Icelandic thorn, or undotted i). There is a full set of > diacritics, which precede the letter they modify. You can think of them as > non-spacing characters (so that the following letter prints on top of the > diacritic). A better way to think of them however is as "alert" codes, > specifying that it and the following code form a 16-bit specification for a > character. The actual dot pattern may be formed by superposition, or it may > be stored in a separate "rendering" set (to make a better-looking character > than could be produced by superimposing a letter and a separate diacritic). > The rest of the 96 extra characters are punctuation (such as inverted > exclamation and question for Spanish), some math symbols, etc. In fact, the > first 32 characters of ISO 8859 are nearly identical to the first 32 8-bit > characters of ISO 6937. > ISO 6937 is emerging as the standard code set for communication. ISO and CCITT standards on Document Communication all specify this set. The CCITT equivalent is Recommendation T.61. Herman Sibiger ...!ihnp4!homxb!hrs
frisk@askja.UUCP (Fridrik Skulason) (09/05/87)
On the subject of ISO 6937 versus ISO 8859. 6937 may be better than 8859 for some purposes (communication that is), but as a standard character set for terminals it is useless. The reason ... Simple. Most existing software packages assume that (1 char in text = 1 char on screen). Adapting the software to work with a full 256 character set instead of ASCII may be difficult, but it's still just a minor problem compared to make 6937 work as a standard computer/terminal character set. Here in Iceland ISO 8859 is heavily used. It is the only useful standard that supports all chacters in our alphabet. (ISO 8859/1 - Western Europe that is - the last time I looked ISO 8859/2 (or was that /3) - for Northern Europe did not!) It is available on some personal computers here (Amiga/Atari), on some terminals (ADM in particular), and a couple of printers. In fact - just a few days ago we decided that the University would not buy or support a single piece of equipment that did not support ISO 8859/1. (Meaning - in my case - that I have to work overtime to modify the $#?!$#?!$!!! terminal emulation on the Macs. (I have already fixed the PCs)) So - in our case at least - 8859/1 is here to stay.. -- Fridrik Skulason Univ. of Iceland, Computing Center UUCP frisk@rhi BIX frisk "This line intentionally left blank"
bas+@andrew.cmu.edu (Bruce Sherwood) (09/11/87)
A couple people wrote personally to me asking me to send them the ISO 6937 standard, or how to order it. ISO 6937/2-1983 (E) can be ordered from American National Standards Institute Department SD 1430 Broadway New York NY 10018 I can't seem to find the cost, but I think it is about $35 for the paper including shipping and handling ("Price based on 37 pages", according to the cover sheet). Expensive -- but I may be wrong about the exact price. Here is the gist of ISO 6937. It contains standard old ASCII in the slots 32 thru 126. In the upper (8-bit) slots from 161 thru 254 we have the list shown at the end of this note, divided into groups of 16 slots, with "---" indicating "not assigned". The key features are a full set of diacritic codes and a full set of letters used by roman-letter alphabets which aren't in base ASCII and can't be made with diacritics. Together these enable handling 41 different languages, probably constituting almost all roman-letter scripts other than Vietnamese. The codes in the column of diacritics function as escape codes, indicating that it plus the following code constitute a 16-bit specification of a complex character. That complex character may be rendered by superimposing a letter and a diacritic, or some implementations may choose to have a separate "rendering" set of images in which the diacritic is already on the letter (this gives higher-quality print possibilities, of course). Note that altho the diacritic code precedes the associated letter code, a decent computer system should allow the user to type a diacritic key AFTER the letter. Having to type it BEFORE is a bad holdover from mechanical typewriters, which could handle diacritics only by implementing a "dead key" which didn't advance the platen. Linguistically however it makes no sense to type the diacritic before typing the letter, and it should be the job of the input routine to turn the bytes around in memory. InvertedExclamationPoint Cent Pound Dollar Yen --- Section --- LeftSingleQuote LetfDoubleQuote LeftDoubleGuillemet LeftArrow UpArrow RightArrow DownArrow Degree Plus/Minus SuperTwo SuperThree Multiply Micro Paragraph CenteredDot Divide RightSingleQuote RightDoubleQuote RightDoubleGuillemet OneQuarter OneHalf ThreeQuarters InvertedQuestionMark --- Grave Acute Circumflex Tilde Macron Breve OverDot Diaeresis --- OverRing Cedilla Underline DoubleAcute Ogonek Hachek HorizontalBar SuperOne Registered Copyright Trademark MusicNote --- --- --- --- --- --- OneEighth ThreeEighths FiveEighths SevenEighths Ohm UppercaseDigraphAE UppercaseStrokeD OrdinalA UppercaseStrokeH LowercaseDotlessJ UppercaseDigraphIJ UppercaseMiddleDotL UppercaseStrokeL UppercaseSlashO UppercaseDigraphOE OrdinalO UppercaseThorn UppercaseStrokeT UppercaseEngma LowercaseApostropheN LowercaseGreenlandicK LowercaseDigraphAE LowercaseStrokeD LowercaseEth LowercaseStrokeH LowercaseDotlessI LowercaseDigraphIJ LowercaseDotL LowercaseStrokeL LowercaseSlashO LowercaseDigraphOE LowercaseDoubleS LowercaseThorn LowercaseStrokeT LowercaseEngma Bruce Sherwood Center for Design of Educational Computing and Information Technology Center Carnegie Mellon University