[net.text] troff special chars - naming them

aeb@mcvax.UUCP (Andries Brouwer) (07/19/85)

In article <1065@diku.UUCP> keld@diku.UUCP (Keld J|rn Simonsen) writes:
>
>I would like to discuss special char naming in troff (titroff/ditroff)
> ...
>The rules: when a letter is composed of an accent and a latin letter
>you specify the accent first. Accents used: ' ` : , ~ ^ maybe more.
>
>Special letters not accented: ae AE oe OE /o /O oa oA (Aangstroem)
>ij IJ ss. The meaning should be obvious.
>
I think we should include the graphics used in eastern european countries
[or, more generally, all characters used in countries where the Latin
 alphabet is used]
(so that a proposed norm will still be usable ten years from now,
 when also Poland will be using UNIX - moreover, people here have use for
 these symbols every now and then).
As a result, oa is not so special any more; in Roumenian one has ou (a u with
small circle on top), so o should be classed among the accents.
Similarly v (havcek) is an accent (popular in Czech).
And / is an accent, not only for the scandinavian /o but also for the Polish /l.
Icelandic has -d (and people doing quantum mechanics love -h),
Polish has .z, Turkish dotless i (how do you represent that?);
Hungarian knows '' (two acute accents, as in the name of the famous
mathematician P'al Erd"os - I use "o instead of :o - this is distinct from the
umlaut).
In Roumenian one also has an accent that I cannot represent conveniently:
on top of a vowel one may have a circular arc (like v but rounded),
perhaps a u prefix would be sufficiently suggestive in your scheme.

keld@diku.UUCP (Keld J|rn Simonsen) (07/19/85)

<>

I would like to discuss special char naming in troff (titroff/ditroff)
This area is very messy, every typesetter owner seems to invent
his own names, and we thereby loose portability of documents.
This is especially true in Europe, where we have all these
strange languages, almost one per country.

The rules: when a letter is composed of an accent and a latin letter
you specify the accent first. Accents used: ' ` : , ~ ^ maybe more.

Special letters not accented: ae AE oe OE /o /O oa oA (Aangstroem)
ij IJ ss. The meaning should be obvious.

Design criteria: I have chosen a graphical memnonic scheme
of international reasons. If you choose an abbrevation
which is language bound, this is not likely to be recognised
in other countries with other languages. Also the same graphical
char may have different meanings in various countries, eg. the
oA - in Danish it is known as a Danish letter AA,
in other countries it is just known as Aangstroem,
it seems better to use the graphical description:
an 'A' with a circle above it.

aeb@mcvax.UUCP (Andries Brouwer) (07/21/85)

In article <1070@diku.UUCP> keld@diku.UUCP (Keld J|rn Simonsen) writes:
>Fine to me with all these accents that Andries mentioned. And I do not
>care too much saying that /o is accented, although that was not what I
>learned in school. So we move / and o to be accents. So it goes:
>
>When a letter is composed of an accent and a latin letter
>you specify the accent first. Accents used: ' ` : , ~ ^ / o - . v u "
>and maybe more.
>
>Special letters not accented: ae AE oe OE ij IJ ss.
>The dotless i might be .i - which read normally is meaningless -
>there is a dot there already.
>
>The problems with the u and v accents are that they may lead to
>names already defined in the standard troff. I have not investigated
>this, maybe Andries know more about the possibilities.

I know of \(ul for underline, \(ua for up-arrow and \(or for |.
(These are rather standard; of course many sites have invented
symbol names of their own and assigned random 2-char names.)
Unfortunately, I need the \(ua for my Romanian a with arc on top,
but one might use \(Ua for lower case and \(UA for upper case
(and similarly \(Oa and \(OA for aa and AA, etc.).
Using two-letter symbol names starting with a capital also greatly
reduces the risk for conflict with already defined symbols.

keld@diku.UUCP (Keld J|rn Simonsen) (07/21/85)

Fine to me with all these accents that Andries mentioned. And I do not
care too much saying that /o is accented, although that was not what I
learned in school. So we move / and o to be accents. So it goes:

When a letter is composed of an accent and a latin letter
you specify the accent first. Accents used: ' ` : , ~ ^ / o - . v u "
and maybe more.

Special letters not accented: ae AE oe OE ij IJ ss.
The dotles i might be .i - which read normally is meaningless -
there is a dot there already.

The problems with the u and v accents are that they may lead to
names already defined in the standard troff. I have not investigated
this, maybe Andries know more about the possibilities.

keld@diku.UUCP (Keld J|rn Simonsen) (07/22/85)

<>

Indeed, ua is standard troff for upwards arrow. I would be happy with
reserving char names with the capital letters O U V to be used primarily
for "accented" letters, if we can agree to it. It is better to have a
standard - and that a quite complete, simple and coherent one -
than have names which are the first to come in mind. Oa is a bit
far away from the natural (to Scandinavians) aa, but I can handle
all that stuff in my .tr based .la (language shift or output character
set shift) macro.

.la is a 20-line troff macro handling shifts
between ASCII, British BS, German DIN, French, Danish, Swedish,
Norwegian, Finnish, Spanish and ISO international reference version,
which converts 7-bit input code to output in the relevant ISO National
Character Set Version of ISO 646-1983.
To include other National ISO char set versions is trivial.
The character set shift is done in troff directly via the .tr directive.
No time consuming non-standard special preprocessor is needed.

People can then just write their national characters as they are used
to, eg: {, |, } (for what we at the moment call: ae, /o, Oa)
and shift to ASCII output when needed. And switch back again ...

andersa@kuling.UUCP (Anders Andersson) (07/22/85)

In article <763@mcvax.UUCP> aeb@mcvax.UUCP (Andries Brouwer) writes:
>In article <1065@diku.UUCP> keld@diku.UUCP (Keld J|rn Simonsen) writes:
>>
>>I would like to discuss special char naming in troff (titroff/ditroff)
>> ...
>>The rules: when a letter is composed of an accent and a latin letter
>>you specify the accent first. Accents used: ' ` : , ~ ^ maybe more.
>>
>>Special letters not accented: ae AE oe OE /o /O oa oA (Aangstroem)
>>ij IJ ss. The meaning should be obvious.
>>
>I think we should include the graphics used in eastern european countries
>[or, more generally, all characters used in countries where the Latin
> alphabet is used]

This suggestion seems to be a re-invention of what already exists in TeX,
so why not just take it as it is? TeX defines all those national letters
in almost the same way, and I think it's a bad idea to have two such
similar standards, causing lots of problems for those poor guys who have
to learn both.

aeb@mcvax.UUCP (Andries Brouwer) (07/26/85)

Last time I just mentioned a few accents that occurred to me while
writing - let me now give a more detailed overview of what accents
exist.

1. Accents on top

- Acute accent (') occurs on top of almost anything; many languages
  have 'a 'e 'i 'o 'u ; Icelandic also 'y ; Slovak also 'y 'r 'l ;
  Polish also 'c 'n 's 'z ; Latvian has a character that is sometimes
  printed as 'g (see below); etc.
  Note that the ' on 'a has not the same slope as the ' on 'i .

- Grave accent (`) occurs in many languages in `a `e `i `o `u ;
  Slovene `r

- Circumflex (^) occurs in many languages in ^a ^e ^i ^o ^u ;
  Esperanto has ^c ^g ^h ^j ^s ; accented Latvian has ^l .

- Trema/Diaeresis/Umlaut (::/") occurs as umlaut in many languages in
  "a "o "u (e.g. German, Slovak, Finnish, Swedish, Turkish, Hungarian);
  as trema in ::a ::e ::i ::o ::u .

- Hacek (h\'a\vcek) (v) occurs in many Slavic languages; Czech has
  ve vc vn vs vr vz ; Slovak also vD ; Esperanto vu .
  In transcriptions one meets other letters with hacek, e.g. Armenian vj .
- When the letter that should get the hacek is tall, then it gets a
  comma at the upper right instead: Czech has ,d ,t ; Slovak also ,l .

- Dot above (:) occurs in various places; the most obvious ones are
  :z in Polish and :e in Lithuanian, but I found it also e.g. as :n
  in the African language Bamoum.

- Macron (overline) (-) occurs as -a -e -i in Latvian, as -u in Lithuanian
  and is otherwise generally used to denote the length of vowels.

- Corona (circle above) (o) is found in Scandinavian oa and Czech ou .

- Tilde (~) is found in Spanish ~n , Portuguese ~a ~o and otherwise e.g.
  in accented Baltic languages: ~a ~e ~i ~o ~y ~m ~n ~l ~r ~.e .

- Breve (half circle above) (U) is found in Rumanian Ua , Turkish Ug ,
  Vietnamese Ua and is otherwise generally used to denote short vowels.

- Double acute ('') is found in Hungarian ''o and ''u .

- High tone mark (question mark without dot) (?) is found in Vietnamese
  ?a ?o ?u .

- In Latvian the palatalized sounds have a comma below, as we shall see,
  but in ,g there is no room for the , to go below, and one finds it on
  top instead. I have met three variations: 'g (acute accent), ,g (high
  centered comma) and I,g (high centered inverted comma).
  Sometimes the high centered inverted comma is met in other places; I
  have seen I,k and I,t in transliterated Armenian and I,p in Sorbian.

- In old Croatic texts one finds the double grave accent (``) as in
  ``a ``e ``i ``r .


  
2. Accents below

- Cedille (,) or left hook occurs in French ,c ; in Turkish ,s ;
  in Rumanian ,s ,t ; in Latvian ,k ,l ,n ,r (and ,K ,L ,N ,R ,G - for ,g
  see above).
  These hooks do not always resemble a comma.

- Rude (L) or right hook occurs in Polish La Le ; in Thai and old Norse Lo ;
  in Lithuanian La Li Lu ; in old Latvian Le Lk .
  These hooks start right from the center, sometime almost at the center,
  sometimes at the lower right hand corner.

- Dot below (.) occurs in Vietnamese .a .e .o ; in transliterations from
  Arabic or Sanskrit one meets .d .t .s .r .h etc.

- Corona below (0) occurs in transliterations, often to indicate that a
  sonorant has syllabic value: 0m 0n 0l 0r 0s .

- Breve below (u) occurs in transliteration of Sanskrit and Hittite uh .

- Double dot below (..) seems to occur in transliterated Urdu ..t .

- Vertical bar below (|) seems to occur in Yoruba |o .

- Circumflex below (A) seems to occur in Bamileki and Venda Ae .


3. Accents on more than one letter simultaneously

- An arc on top may join two letters, like in the transliteration of
  the Russian "relected R" as IU{ia} .

- In Tagalog occurs a tilde on the ng digraph: ~{ng} .

- Underline (_) is often used to indicate that two letters transliterate
  one sound, e.g. in various Indian languages _{kh} .

- Similarly the double underline (=) is sometimes used when the combination
  of two letters stands must represent two distinct sounds, e.g. Urdu ={gh} .
  (See also the ligature above.)


Note that I do not propose a naming scheme for accented symbols here - the
chosen denotations are purely ad hoc. Simple schemes as discussed earlier
almost always work, but fail when one letter carries several diacritical
marks. In Vietnamese one finds letters with acute and circumflex
side by side (so that it looks like a rotated 'less than or equals' sign):
{'^}a {'^}e {'^}o and towers like '^o ^a. ?Ua ~^e (read from top to bottom).
In Lithuanian one meets ~.e ~u, {.'}e '-u etc.
Clearly, when symbols can have three or more accents in various mutual
positions then some nontrivial grammar is needed to describe the situation.




4. Special symbols

Various ligatures are conventionally treated as a single symbol.
One has Dutch ij , German ss (or sz), French oe and
Scandinavian (and Latin) ae .

Turkish has dotless i (.i).

Icelandic has the thorn (bp) or (th).

Some symbols with a crossbar are
Polish /l and /L ; Scandinavian /o and /O ; Vietnamese and Yugoslavian
and Icelandic -d and -D ; Icelandic +d (eth).



Well, this is what I have found so far. The places where I said
"seems to occur" the information is quoted from an old draft version
of ISO standard ISO 5426 (dated 1975-07-10).
I would be thankful if people mailed me their additions and corrections.

irenas@tekig4.UUCP (Irena Sifrar) (08/01/85)

Andries Brouwer writes:
>1. Accents on top
>
>- Grave accent (`) occurs in many languages in `a `e `i `o `u ;
>  Slovene `r
>
I have never seen `r in Slovene.  There are no accents on Slovene letters
except when you want to denote the stress (mostly only dictionary
use).  In a way "r" can be one of the stressed letters, as in "mrtev",
but the word is actually pronounced [mer'tev], so the accent actually
falls on the implicit e (sounds like "a" in English, not like "ei").
I'd really like to see some examples of `r, if there are any.

Actually, Slovene does have three occurrences of accent that just have
to be there: hacek on top of c, s, z.  Even if c, s, or z are capitalized,
the hacek remains itself. (see below)

>- When the letter that should get the hacek is tall, then it gets a
>  comma at the upper right instead: Czech has ,d ,t ; Slovak also ,l .
>

>4. Special symbols
>
>Some symbols with a crossbar are
>Polish /l and /L ; Scandinavian /o and /O ; Vietnamese and Yugoslavian
>and Icelandic -d and -D ; Icelandic +d (eth).
>
There is no such language as Yugoslavian.  There is Macedonian,
Serbo-Croatian (slight differences between the two), and Slovene. 
Serbo-Croatian is the most common, even the Macedonians and the Slovenes
can speak in it.  The language of the government is usually Serbo-Croatian,
though at the assemblies people can talk in any of the three above 
mentioned languages.
			Irena Sifrar