[can.general] Accents et objectifs

roberts@cognos.uucp (Robert Stanley) (07/30/87)

In article <232@Mannix.iros1.UUCP> fortin@iros1.UUCP (Denis Fortin) writes:

>Evidemment, je ne suggere pas de faire de l'ASCII-bang-bang un
>standard pan-canadien (eeeeeek!) mais tout de meme, c'est mieux
>avec des accents!

Bien sur, c'est mieux avec des accents.  But it's ever so much easier to enter
data without accents, even when you have a nice dead-key system such as that
used on the Macintosh.  And I, for one, find reading these ugly inserted
special characters a real impediment.  What is more, they drive some of my
regular expressions and smart search utilities crazy.

Their only merit seems to be for display (read: print) purposes, unless some
enterprising net person were to write a neat filter to translate (in both
directions) between an agreed inserted accent standard and some agreed term-cap
extension for terminals capable of displaying accented letters.  However, I
think this brings us round to the need for a character set which includes all
possible accented letter combinations as unique codes, or a standard for
encoding accents in text strings.

An earlier posting suggested that 144 character codes would be required to
enable a full complement of accented letters to be encoded as single
characters.  Unfortunately 144 into 128 doesn't go, and I have this strange
feeling that 8-bit ASCII has already been grabbed by the graphics fraternity.
Perhaps we should turn back to Big Blue and adopt EBCDIC as the universal (let
us not be provincial in this matter) standard.

To be serious, this is a problem that is going to have to be resolved, much as
the Japanese are coming to terms with encoding and displaying their character
sets.  For all languages that require addition/alteration to the standard Roman
alphabet, left to right entered and unaugmented, we need an internal
representation which takes collating sequences into account, an agreed display
standard, and an acceptable data-entry mechanism.  Clearly there are a variety
of techniques that can be employed to assist translate inbound and outbound
characters, ranging from phonetic keyboards (Boswell, where are you?) to
reasoning mechanisms capable of deducing accents that need to be displayed.
But at the heart we need not just a consensus, but a standard.

If there is this much interest in can.francais, perhaps a serious effort can be
mounted, which might look (say) at all roman alphabet-based written languages,
with a view to developing the step beyond ASCII.  I wonder what happens to net
traffic volumes when the basic character requires 14 bits! :-)

Never let us forget that computers are tools, and we the users.  We should not
have to compromise our working standards simply because the tools are
inadequate.  If accents are important, which they certainly are, then let us
work towards tools that makes it easy to employ them.  After all, memory
finally is cheap, data storage is getting cheap, and I suspect that high-speed
data communications will be cheap in a year or three.

<< you better be prepared to make the most of the future,
   because that's where you're going to live the rest of your life>>

-- 
Robert Stanley           Compuserve: 76174,3024        Cognos Incorporated
 uucp: decvax!utzoo!dciem!nrcaer!cognos!roberts        3755 Riverside Drive 
                   or  ...nrcaer!uottawa!robs          Ottawa, Ontario
Voice: (613) 738-1440 - Tuesdays only (don't ask)      CANADA  K1G 3N3

alastair@geovision.UUCP (Alastair Mayer) (07/30/87)

In article <1202@cognos.UUCP> roberts@cognos.UUCP (Robert Stanley) writes:
> [...]  However, I
>think this brings us round to the need for a character set which includes all
>possible accented letter combinations as unique codes, or a standard for
>encoding accents in text strings.
>
>  [..more stuff on various char codes, eg EBCDIC or 144-char sets omitted]
>

Uh, does anyone out there remember NAPLPS? (aka ANSI standard <whatever>,
aka CSA standard <whatever>, aka Telidon)  This code defines standard
ways for transmitting any accented character needed in virtually all
European languages.  The sequence was essentially 
   <special escape code><accent code><letter>
The escape-code indicated a single-letter shift to the appropriate G set,
in which the codes for accents (and cedillas, umlauts, etc) are defined
as "non-spacing", ie printed without moving the cursor.  Whatever letter
follows then overstrikes the accent. (My NAPLPS ref isn't handy so I
can't give the specific codes).
   I once proposed this as a method for encoding accented letters in
another electronic messaging system (CoSy) and some work was done toward
modifying PC terminal programs to support this, but it sort of withered
from lack of interest.

   On a more concrete note, Hewlett-Packard seems dedicated to supporting
an international character set, based on 16-bit characters.  A lot (but
not all) of the commands on our H-P 840 Unix system claim to support
the 16-bit char set (I've nver had the need or opportunity to try it).
Japan is also going to a 16-bit character Unix.
    Is the 8-bit character as obsolete as the 6-bit character? :-)
-- 
 Alastair JW Mayer     BIX: al
                      UUCP: ...!utzoo!dciem!nrcaer!cognos!geovision!alastair

(Why do they call it a signature file if I can't actually *sign* anything?)

egisin@orchid.UUCP (08/02/87)

In article <1202@cognos.UUCP>, roberts@cognos.uucp (Robert Stanley) writes:
> An earlier posting suggested that 144 character codes would be required to
> enable a full complement of accented letters to be encoded as single
> characters.  Unfortunately 144 into 128 doesn't go, and I have this strange
> feeling that 8-bit ASCII has already been grabbed by the graphics fraternity.
> Perhaps we should turn back to Big Blue and adopt EBCDIC as the universal (let
> us not be provincial in this matter) standard.

8 bit ascii has been around for a few years now.
I think it is defined by Ansi X3.64, which adds 32 control and 94 graphic
characters to the 128 existing ascii characters.

ISO Latin 1 is an extended ascii character set with about 60 accented
characters, suitable for most western languages. VT200 terminals
and many new laser printers have this character set.
I haven't seen much support for 8 bit ascii in software
other that DEC's software and the MKS' toolkit.

flaps@utcsri.UUCP (08/05/87)

In article <1202@cognos.UUCP> roberts@cognos.UUCP (Robert Stanley) writes:
>However, I
>think this brings us round to the need for a character set which includes all
>possible accented letter combinations as unique codes...

>If there is this much interest in can.francais, perhaps a serious effort can be
>mounted, which might look (say) at all roman alphabet-based written languages,

The ISO Latin 1 standard is based on this idea... it claims to work for
Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic,
Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish (as used in
specific countries.. I won't bother with the country list as it is longer).

I think all of the characters relevant to French are the following.  I
might have missed a couple or mistakenly included a couple here, but
they're all in the standard.

224  14/00  a`  `  latin small letter a with grave accent
226  14/02  a^  b  latin small letter a with circumflex accent
228  14/04  a"  d  latin small letter a with diaeresis
231  14/07  c,  g  latin small letter c with cedilla
232  14/08  e`  h  latin small letter e with grave accent
233  14/09  e'  i  latin small letter e with acute accent
234  14/10  e^  j  latin small letter e with circumflex accent
235  14/11  e"  k  latin small letter e with diaeresis
238  14/14  i^  n  latin small letter i with circumflex accent
239  14/15  i"  o  latin small letter i with diaeresis
244  15/04  o^  t  latin small letter o with circumflex accent
246  15/06  o"  v  latin small letter o with diaeresis
249  15/09  u`  y  latin small letter u with grave accent
252  15/12  u"  |  latin small letter u with diaeresis

There doesn't seem to be an oe diphthong, but this isn't really necessary,
especially these days.

-- 

      //  Alan J Rosenthal
     //
 \\ //        flaps@csri.toronto.edu, {seismo!utai or utzoo}!utcsri!flaps,
  \//              flaps@toronto on csnet, flaps at utorgpu on bitnet.


"To be whole is to be part; true voyage is return."

fortin@iros1.UUCP (08/14/87)

	Ce message (un peu long) est forme a partir de deux messages
ayant paru recemment dans le groupe comp.std.internat.  Il decrit 
un code 8-bits qui contient les caracteres accentues utilises par 
plusieurs pays de l'Europe de l'ouest.  Ce code, connu sous le nom
d'ISO-Latin/1 est maintenant un standard international.  (Enfin, lire
le texte qui suit pour avoir plus de details!)  (PS. Et pas d'EBCDIC!)


						Denis Fortin
						fortin@zap.UUCP
						fortin@iros1.UUCP


From: sommar@enea.UUCP (Erland Sommarskog)  (in Stockholm)

				* ISO-Latin/1 *

	The byte value is in the document represented by a notation xx/yy, 
where xx is the upper nibble (four bits), and yy is the lower nibble (in 
decimal).  The lower part of the table, i.e. positions 02/00 to 07/14 is 
exactly the same as ASCII.

	The upper part of the table contains the characters we can't live 
without in large parts of the world.  Since I do not know how to send pictures
in a standardised way (is macpaint documents OK?), I here include a table 
from ISO No.1:

	(Note: This was the draft.  See below for more info)

10/00   NO-BREAK SPACE
10/01   INVERTED EXCLAMATION MARK
10/02   CENT SIGN
10/03   POUND SIGN
10/04   CURRENCY SIGN
10/05   YEN SIGN
10/06   BROKEN BAR
10/07   PARAGRAPH SIGN, SECTION SIGN
10/08   DIAERESIS
10/09   COPYRIGHT SIGN
10/10   FEMININE ORDINAL INDICATOR
10/11   LEFT ANGLE QUOTATION MARK
10/12   NOT SIGN
10/13   SOFT HYPHEN
10/14   REGISTERED TRADE MARK SIGN
10/15   MACRON
11/00   DEGREE SIGN
11/01   PLUS-MINUS SIGN
11/02   SUPERSCRIPT TWO
11/03   SUPERSCRIPT THREE
11/04   ACUTE ACCENT
11/05   SMALL GREEK LETTER MU, MICRO SIGN
11/06   PILCROW SIGN
11/07   MIDDLE DOT
11/08   CEDILLA
11/09   SUPERSCRIPT ONE
11/10   MASCULINE ORDINAL INDICATOR
11/11   RIGHT ANGLE QUOTATION MARK
11/12   VULGAR FRACTION ONE QUARTER
11/13   VULGAR FRACTION ONE HALF
11/14   VULGAR FRACTION THREE QUARTERS
11/15   INVERTED QUESTION MARK
12/00   CAPITAL LETTER A WITH GRAVE ACCENT
12/01   CAPITAL LETTER A WITH ACUTE ACCENT
12/02   CAPITAL LETTER A WITH CIRCUMFLEX ACCENT
12/03   CAPITAL LETTER A WITH TILDE
12/04   CAPITAL LETTER A DIAERESIS
12/05   CAPITAL LETTER A WITH RING ABOVE
12/06   CAPITAL DIPHTHONG A WITH E
12/07   CAPITAL LETTER C WITH CEDILLA
12/08   CAPITAL LETTER E WITH GRAVE ACCENT
12/09   CAPITAL LETTER E WITH ACUTE ACCENT
12/10   CAPITAL LETTER E WITH CIRCUMFLEX ACCENT
12/11   CAPITAL LETTER E WITH DIAERESIS
12/12   CAPITAL LETTER I WITH GRAVE ACCENT
12/13   CAPITAL LETTER I WITH ACUTE ACCENT
12/14   CAPITAL LETTER I WITH CIRCUMFLEX ACCENT
12/15   CAPITAL LETTER I WITH DIAERESIS
13/00   CAPITAL ICELANDIC LETTER ETH
13/01   CAPITAL LETTER N WITH TILDE
13/02   CAPITAL LETTER O WITH GRAVE ACCENT
13/03   CAPITAL LETTER O WITH ACUTE ACCENT
13/05   CAPITAL LETTER O WITH TILDE
13/06   CAPITAL LETTER O WITH DIAERESIS
13/07   (This position shall not be used)
13/08   CAPITAL LETTER O WITH OBLIQUE STROKE
13/09   CAPITAL LETTER U WITH GRAVE ACCENT
13/10   CAPITAL LETTER U WITH ACUTE ACCENT
13/11   CAPITAL LETTER U WITH CIRCUMFLEX ACCENT
13/12   CAPITAL LETTER U WITH DIAERESIS
13/13   CAPITAL LETTER Y WITH ACUTE ACCENT
13/14   CAPITAL ICELANDIC LETTER THORN
13/15   SMALL GERMAN LETTER SHARP s
14/00   SMALL LETTER a WITH GRAVE ACCENT
14/01   SMALL LETTER a WITH ACUTE ACCENT
14/02   SMALL LETTER a WITH CIRCUMFLEX ACCENT
14/03   SMALL LETTER a WITH TILDE
14/04   SMALL LETTER a WITH DIAERESIS
14/05   SMALL LETTER a WITH RING ABOVE
14/06   SMALL DIPHTHONG a WITH e
14/07   SMALL LETTER c WITH CEDILLA
14/08   SMALL LETTER e WITH GRAVE ACCENT
14/09   SMALL LETTER e WITH ACUTE ACCENT
14/10   SMALL LETTER e WITH CIRCUMFLEX ACCENT
14/11   SMALL LETTER e WITH DIAERESIS
14/12   SMALL LETTER i WITH GRAVE ACCENT
14/13   SMALL LETTER i WITH ACUTE ACCENT
14/14   SMALL LETTER i WITH CIRCUMFLEX ACCENT
14/15   SMALL LETTER i WITH DIAERESIS
15/00   SMALL ICELANDIC LETTER ETH
15/01   SMALL LETTER n WITH TILDE
15/02   SMALL LETTER o WITH GRAVE ACCENT
15/03   SMALL LETTER o WITH ACUTE ACCENT
15/04   SMALL LETTER o WITH CIRCUMFLEX ACCENT
15/05   SMALL LETTER o WITH TILDE
15/06   SMALL LETTER o WITH DIAERESIS
15/07   (This position shall not be used)
15/08   SMALL LETTER o WITH OBLIQUE STROKE
15/09   SMALL LETTER u WITH GRAVE ACCENT
15/10   SMALL LETTER u WITH ACUTE ACCENT
15/11   SMALL LETTER u WITH CIRCUMFLEX ACCENT
15/12   SMALL LETTER u WITH DIAERESIS
15/13   SMALL LETTER y WITH ACUTE ACCENT
15/14   SMALL ICELANDIC LETTER THORN
15/15   SMALL LETTER y WITH DIAERESIS

End of table

--------------------------------

Note from lasko@video.dec.com (Tim Lasko) about ISO-Latin/1:

ISO Latin-1, or more completely ISO Latin Alphabet No 1, is now an
international standard as of February 1987 (IS 8859, Part 1).
For those American USEnet'rs that care, the 8-bit ASCII standard,
which is essentially the same code, is going through the final 
administrative processes prior to publication.

The code table that was posted earlier by Mr. Sommarskog to the net is from
an earlier draft of the standard, the following changes have been made: 

OLD DRAFT: 
 
13/07   (This position shall not be used)
15/07   (This position shall not be used)

FINAL STANDARD:

13/07    MULTIPLICATION SIGN
15/07    DIVISION SIGN

Those two characters were added mainly out of the fear that individual vendors
would use the positions for non-interchangeable and incompatible purposes,
thus defeating the idea of the standard.  The two symbols chosen were more
or less a compromise from a large list of eligible characters.

ISO Latin-1 (IS 8859/1) is actually one of an entire family of eight-bit
one-byte character sets, all having ASCII on the left hand side, and with
varying repertoires on the right hand side:
     
Pt 1.   Latin Alphabet No 1  (caters to Western Europe - now approved)
Pt 2.   Latin Alphabet No 2  (caters to Eastern Europe - now approved)    
Pt 3.   Latin Alphabet No 3  (caters to SE Europe + others - in draft ballot)
Pt 4.   Latin Alphabet No 4  (caters to Northern Europe - in draft ballot)
Pt 5.   Latin-Cyrillic alphabet  (right half all Cyrillic - processing
                                   currently suspended pending USSR input)
Pt 6.   Latin-Arabic alphabet    (right half all Arabic - now approved)
Pt 7.   Latin-Greek alphabet     (right half Greek + symbols - in draft ballot)
Pt 8.   Latin-Hebrew alphabet    (right half Hebrew + symbols - proposed)
                                                               
I expect to update this list shortly, because next week I'm attending the
meeting of the ISO Working Group concerned with these standards is being held.
(ISO TC97/SC2/WG3 for those that can decipher that.)