roberts@cognos.uucp (Robert Stanley) (07/30/87)
In article <232@Mannix.iros1.UUCP> fortin@iros1.UUCP (Denis Fortin) writes: >Evidemment, je ne suggere pas de faire de l'ASCII-bang-bang un >standard pan-canadien (eeeeeek!) mais tout de meme, c'est mieux >avec des accents! Bien sur, c'est mieux avec des accents. But it's ever so much easier to enter data without accents, even when you have a nice dead-key system such as that used on the Macintosh. And I, for one, find reading these ugly inserted special characters a real impediment. What is more, they drive some of my regular expressions and smart search utilities crazy. Their only merit seems to be for display (read: print) purposes, unless some enterprising net person were to write a neat filter to translate (in both directions) between an agreed inserted accent standard and some agreed term-cap extension for terminals capable of displaying accented letters. However, I think this brings us round to the need for a character set which includes all possible accented letter combinations as unique codes, or a standard for encoding accents in text strings. An earlier posting suggested that 144 character codes would be required to enable a full complement of accented letters to be encoded as single characters. Unfortunately 144 into 128 doesn't go, and I have this strange feeling that 8-bit ASCII has already been grabbed by the graphics fraternity. Perhaps we should turn back to Big Blue and adopt EBCDIC as the universal (let us not be provincial in this matter) standard. To be serious, this is a problem that is going to have to be resolved, much as the Japanese are coming to terms with encoding and displaying their character sets. For all languages that require addition/alteration to the standard Roman alphabet, left to right entered and unaugmented, we need an internal representation which takes collating sequences into account, an agreed display standard, and an acceptable data-entry mechanism. Clearly there are a variety of techniques that can be employed to assist translate inbound and outbound characters, ranging from phonetic keyboards (Boswell, where are you?) to reasoning mechanisms capable of deducing accents that need to be displayed. But at the heart we need not just a consensus, but a standard. If there is this much interest in can.francais, perhaps a serious effort can be mounted, which might look (say) at all roman alphabet-based written languages, with a view to developing the step beyond ASCII. I wonder what happens to net traffic volumes when the basic character requires 14 bits! :-) Never let us forget that computers are tools, and we the users. We should not have to compromise our working standards simply because the tools are inadequate. If accents are important, which they certainly are, then let us work towards tools that makes it easy to employ them. After all, memory finally is cheap, data storage is getting cheap, and I suspect that high-speed data communications will be cheap in a year or three. << you better be prepared to make the most of the future, because that's where you're going to live the rest of your life>> -- Robert Stanley Compuserve: 76174,3024 Cognos Incorporated uucp: decvax!utzoo!dciem!nrcaer!cognos!roberts 3755 Riverside Drive or ...nrcaer!uottawa!robs Ottawa, Ontario Voice: (613) 738-1440 - Tuesdays only (don't ask) CANADA K1G 3N3
alastair@geovision.UUCP (Alastair Mayer) (07/30/87)
In article <1202@cognos.UUCP> roberts@cognos.UUCP (Robert Stanley) writes: > [...] However, I >think this brings us round to the need for a character set which includes all >possible accented letter combinations as unique codes, or a standard for >encoding accents in text strings. > > [..more stuff on various char codes, eg EBCDIC or 144-char sets omitted] > Uh, does anyone out there remember NAPLPS? (aka ANSI standard <whatever>, aka CSA standard <whatever>, aka Telidon) This code defines standard ways for transmitting any accented character needed in virtually all European languages. The sequence was essentially <special escape code><accent code><letter> The escape-code indicated a single-letter shift to the appropriate G set, in which the codes for accents (and cedillas, umlauts, etc) are defined as "non-spacing", ie printed without moving the cursor. Whatever letter follows then overstrikes the accent. (My NAPLPS ref isn't handy so I can't give the specific codes). I once proposed this as a method for encoding accented letters in another electronic messaging system (CoSy) and some work was done toward modifying PC terminal programs to support this, but it sort of withered from lack of interest. On a more concrete note, Hewlett-Packard seems dedicated to supporting an international character set, based on 16-bit characters. A lot (but not all) of the commands on our H-P 840 Unix system claim to support the 16-bit char set (I've nver had the need or opportunity to try it). Japan is also going to a 16-bit character Unix. Is the 8-bit character as obsolete as the 6-bit character? :-) -- Alastair JW Mayer BIX: al UUCP: ...!utzoo!dciem!nrcaer!cognos!geovision!alastair (Why do they call it a signature file if I can't actually *sign* anything?)
egisin@orchid.UUCP (08/02/87)
In article <1202@cognos.UUCP>, roberts@cognos.uucp (Robert Stanley) writes: > An earlier posting suggested that 144 character codes would be required to > enable a full complement of accented letters to be encoded as single > characters. Unfortunately 144 into 128 doesn't go, and I have this strange > feeling that 8-bit ASCII has already been grabbed by the graphics fraternity. > Perhaps we should turn back to Big Blue and adopt EBCDIC as the universal (let > us not be provincial in this matter) standard. 8 bit ascii has been around for a few years now. I think it is defined by Ansi X3.64, which adds 32 control and 94 graphic characters to the 128 existing ascii characters. ISO Latin 1 is an extended ascii character set with about 60 accented characters, suitable for most western languages. VT200 terminals and many new laser printers have this character set. I haven't seen much support for 8 bit ascii in software other that DEC's software and the MKS' toolkit.
flaps@utcsri.UUCP (08/05/87)
In article <1202@cognos.UUCP> roberts@cognos.UUCP (Robert Stanley) writes: >However, I >think this brings us round to the need for a character set which includes all >possible accented letter combinations as unique codes... >If there is this much interest in can.francais, perhaps a serious effort can be >mounted, which might look (say) at all roman alphabet-based written languages, The ISO Latin 1 standard is based on this idea... it claims to work for Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish (as used in specific countries.. I won't bother with the country list as it is longer). I think all of the characters relevant to French are the following. I might have missed a couple or mistakenly included a couple here, but they're all in the standard. 224 14/00 a` ` latin small letter a with grave accent 226 14/02 a^ b latin small letter a with circumflex accent 228 14/04 a" d latin small letter a with diaeresis 231 14/07 c, g latin small letter c with cedilla 232 14/08 e` h latin small letter e with grave accent 233 14/09 e' i latin small letter e with acute accent 234 14/10 e^ j latin small letter e with circumflex accent 235 14/11 e" k latin small letter e with diaeresis 238 14/14 i^ n latin small letter i with circumflex accent 239 14/15 i" o latin small letter i with diaeresis 244 15/04 o^ t latin small letter o with circumflex accent 246 15/06 o" v latin small letter o with diaeresis 249 15/09 u` y latin small letter u with grave accent 252 15/12 u" | latin small letter u with diaeresis There doesn't seem to be an oe diphthong, but this isn't really necessary, especially these days. -- // Alan J Rosenthal // \\ // flaps@csri.toronto.edu, {seismo!utai or utzoo}!utcsri!flaps, \// flaps@toronto on csnet, flaps at utorgpu on bitnet. "To be whole is to be part; true voyage is return."
fortin@iros1.UUCP (08/14/87)
Ce message (un peu long) est forme a partir de deux messages ayant paru recemment dans le groupe comp.std.internat. Il decrit un code 8-bits qui contient les caracteres accentues utilises par plusieurs pays de l'Europe de l'ouest. Ce code, connu sous le nom d'ISO-Latin/1 est maintenant un standard international. (Enfin, lire le texte qui suit pour avoir plus de details!) (PS. Et pas d'EBCDIC!) Denis Fortin fortin@zap.UUCP fortin@iros1.UUCP From: sommar@enea.UUCP (Erland Sommarskog) (in Stockholm) * ISO-Latin/1 * The byte value is in the document represented by a notation xx/yy, where xx is the upper nibble (four bits), and yy is the lower nibble (in decimal). The lower part of the table, i.e. positions 02/00 to 07/14 is exactly the same as ASCII. The upper part of the table contains the characters we can't live without in large parts of the world. Since I do not know how to send pictures in a standardised way (is macpaint documents OK?), I here include a table from ISO No.1: (Note: This was the draft. See below for more info) 10/00 NO-BREAK SPACE 10/01 INVERTED EXCLAMATION MARK 10/02 CENT SIGN 10/03 POUND SIGN 10/04 CURRENCY SIGN 10/05 YEN SIGN 10/06 BROKEN BAR 10/07 PARAGRAPH SIGN, SECTION SIGN 10/08 DIAERESIS 10/09 COPYRIGHT SIGN 10/10 FEMININE ORDINAL INDICATOR 10/11 LEFT ANGLE QUOTATION MARK 10/12 NOT SIGN 10/13 SOFT HYPHEN 10/14 REGISTERED TRADE MARK SIGN 10/15 MACRON 11/00 DEGREE SIGN 11/01 PLUS-MINUS SIGN 11/02 SUPERSCRIPT TWO 11/03 SUPERSCRIPT THREE 11/04 ACUTE ACCENT 11/05 SMALL GREEK LETTER MU, MICRO SIGN 11/06 PILCROW SIGN 11/07 MIDDLE DOT 11/08 CEDILLA 11/09 SUPERSCRIPT ONE 11/10 MASCULINE ORDINAL INDICATOR 11/11 RIGHT ANGLE QUOTATION MARK 11/12 VULGAR FRACTION ONE QUARTER 11/13 VULGAR FRACTION ONE HALF 11/14 VULGAR FRACTION THREE QUARTERS 11/15 INVERTED QUESTION MARK 12/00 CAPITAL LETTER A WITH GRAVE ACCENT 12/01 CAPITAL LETTER A WITH ACUTE ACCENT 12/02 CAPITAL LETTER A WITH CIRCUMFLEX ACCENT 12/03 CAPITAL LETTER A WITH TILDE 12/04 CAPITAL LETTER A DIAERESIS 12/05 CAPITAL LETTER A WITH RING ABOVE 12/06 CAPITAL DIPHTHONG A WITH E 12/07 CAPITAL LETTER C WITH CEDILLA 12/08 CAPITAL LETTER E WITH GRAVE ACCENT 12/09 CAPITAL LETTER E WITH ACUTE ACCENT 12/10 CAPITAL LETTER E WITH CIRCUMFLEX ACCENT 12/11 CAPITAL LETTER E WITH DIAERESIS 12/12 CAPITAL LETTER I WITH GRAVE ACCENT 12/13 CAPITAL LETTER I WITH ACUTE ACCENT 12/14 CAPITAL LETTER I WITH CIRCUMFLEX ACCENT 12/15 CAPITAL LETTER I WITH DIAERESIS 13/00 CAPITAL ICELANDIC LETTER ETH 13/01 CAPITAL LETTER N WITH TILDE 13/02 CAPITAL LETTER O WITH GRAVE ACCENT 13/03 CAPITAL LETTER O WITH ACUTE ACCENT 13/05 CAPITAL LETTER O WITH TILDE 13/06 CAPITAL LETTER O WITH DIAERESIS 13/07 (This position shall not be used) 13/08 CAPITAL LETTER O WITH OBLIQUE STROKE 13/09 CAPITAL LETTER U WITH GRAVE ACCENT 13/10 CAPITAL LETTER U WITH ACUTE ACCENT 13/11 CAPITAL LETTER U WITH CIRCUMFLEX ACCENT 13/12 CAPITAL LETTER U WITH DIAERESIS 13/13 CAPITAL LETTER Y WITH ACUTE ACCENT 13/14 CAPITAL ICELANDIC LETTER THORN 13/15 SMALL GERMAN LETTER SHARP s 14/00 SMALL LETTER a WITH GRAVE ACCENT 14/01 SMALL LETTER a WITH ACUTE ACCENT 14/02 SMALL LETTER a WITH CIRCUMFLEX ACCENT 14/03 SMALL LETTER a WITH TILDE 14/04 SMALL LETTER a WITH DIAERESIS 14/05 SMALL LETTER a WITH RING ABOVE 14/06 SMALL DIPHTHONG a WITH e 14/07 SMALL LETTER c WITH CEDILLA 14/08 SMALL LETTER e WITH GRAVE ACCENT 14/09 SMALL LETTER e WITH ACUTE ACCENT 14/10 SMALL LETTER e WITH CIRCUMFLEX ACCENT 14/11 SMALL LETTER e WITH DIAERESIS 14/12 SMALL LETTER i WITH GRAVE ACCENT 14/13 SMALL LETTER i WITH ACUTE ACCENT 14/14 SMALL LETTER i WITH CIRCUMFLEX ACCENT 14/15 SMALL LETTER i WITH DIAERESIS 15/00 SMALL ICELANDIC LETTER ETH 15/01 SMALL LETTER n WITH TILDE 15/02 SMALL LETTER o WITH GRAVE ACCENT 15/03 SMALL LETTER o WITH ACUTE ACCENT 15/04 SMALL LETTER o WITH CIRCUMFLEX ACCENT 15/05 SMALL LETTER o WITH TILDE 15/06 SMALL LETTER o WITH DIAERESIS 15/07 (This position shall not be used) 15/08 SMALL LETTER o WITH OBLIQUE STROKE 15/09 SMALL LETTER u WITH GRAVE ACCENT 15/10 SMALL LETTER u WITH ACUTE ACCENT 15/11 SMALL LETTER u WITH CIRCUMFLEX ACCENT 15/12 SMALL LETTER u WITH DIAERESIS 15/13 SMALL LETTER y WITH ACUTE ACCENT 15/14 SMALL ICELANDIC LETTER THORN 15/15 SMALL LETTER y WITH DIAERESIS End of table -------------------------------- Note from lasko@video.dec.com (Tim Lasko) about ISO-Latin/1: ISO Latin-1, or more completely ISO Latin Alphabet No 1, is now an international standard as of February 1987 (IS 8859, Part 1). For those American USEnet'rs that care, the 8-bit ASCII standard, which is essentially the same code, is going through the final administrative processes prior to publication. The code table that was posted earlier by Mr. Sommarskog to the net is from an earlier draft of the standard, the following changes have been made: OLD DRAFT: 13/07 (This position shall not be used) 15/07 (This position shall not be used) FINAL STANDARD: 13/07 MULTIPLICATION SIGN 15/07 DIVISION SIGN Those two characters were added mainly out of the fear that individual vendors would use the positions for non-interchangeable and incompatible purposes, thus defeating the idea of the standard. The two symbols chosen were more or less a compromise from a large list of eligible characters. ISO Latin-1 (IS 8859/1) is actually one of an entire family of eight-bit one-byte character sets, all having ASCII on the left hand side, and with varying repertoires on the right hand side: Pt 1. Latin Alphabet No 1 (caters to Western Europe - now approved) Pt 2. Latin Alphabet No 2 (caters to Eastern Europe - now approved) Pt 3. Latin Alphabet No 3 (caters to SE Europe + others - in draft ballot) Pt 4. Latin Alphabet No 4 (caters to Northern Europe - in draft ballot) Pt 5. Latin-Cyrillic alphabet (right half all Cyrillic - processing currently suspended pending USSR input) Pt 6. Latin-Arabic alphabet (right half all Arabic - now approved) Pt 7. Latin-Greek alphabet (right half Greek + symbols - in draft ballot) Pt 8. Latin-Hebrew alphabet (right half Hebrew + symbols - proposed) I expect to update this list shortly, because next week I'm attending the meeting of the ISO Working Group concerned with these standards is being held. (ISO TC97/SC2/WG3 for those that can decipher that.)