[comp.text.sgml] Umlaut vs. Diaresis

manfred@swi.psy.uva.nl (Manfred Aben) (10/02/90)

Hi there!

On behalf of a friend who does not have access to the NewsNet, I would like
to know if any of you have some ideas about the following (possibly also other
newsgroups):

Does SGML provide a way of specifying an 'Umlaut', as is used in German, in
contrast wit a 'diaresis', used in many languages, such as the Dutch language.
In printed text, i.e. typographically, these two are not distinguishable
anymore. Note that an umlaut historically was printed as two vertical stripes
(") on top of a vowel, whereas a diaresis should e printed as two dots (..) on
top of a vowel.
These two things are not the same, the umlaut changes the sound of the vowel,
whereas the diaresis is used to mark a seperation in a word.

All replies are greatly appreciated (references, ideas?)
You can mail me directly (manfred@swi.psy.uva.nl) or reply in this group.

Znx in advance!

=;-)

----------------------------------------------------
Manfred Aben
Dept. of Social Sciences Informatics
University of Amsterdam
Herengracht 196  1016 BS  AMSTERDAM (The Netherlands)
-----------------------------------------------------

killian@galvia.enet.dec.com (10/09/90)

To: manfred@swi.psy.uva.nl ()
Cc: 
Subject: Re: Umlaut vs. Diaresis (?)

> Does SGML provide a way of specifying an 'Umlaut', as is used in German, in
> contrast wit a 'diaresis', used in many languages, such as the Dutch
language.

In general, yes!  There are two possible ways of doing this but the choice 
of which depends on whether you want to enter the accented character in text 
(ie: content) or markup.

In the more common case where you want to enter the accented character in 
text, SGML would advise the use of an SDATA entity (system specific data 
entity).  For example, &ouml; could be used to enter a small 'o' with an 
umlat, and &odiar; could be used to enter a small 'o' with a diaresis.

Of course these entities need to be declared in the scope of the Document 
Type Definition. This is normally done for a complete collection of such 
entities and transportability is enhansed when user communities can agree 
on, and standardise, such entity sets.  'ouml' is present in the ISO Latin 
1 entity set published as an informative annex to the SGML standard; its 
description includes 'o' diaresis, so this particular entity set will not 
solve your problem.

In addition, an SGML system must be able to correctly interpret the 'o' 
diaresis entity reference when the parser finds it in the text. For example, 
your SGML typesetting system must be able to translate the declared value of 
the 'odiar' entity to the correct glyph shape.

The other solution, which would also allow the use of the accented character 
in markup (eg: tag names), is to use a character set that has a code position 
for the accented character (a different code position than the similar 'o' 
umlat).

SGML is character set independant, in that the SGML declaration (before the 
Document Type Definition, but absent from most SGML documents) allows the 
identification or definition of the document character set. 

Of course, the SGML parser must be able to accept the SGML declaration (not 
every one does) and the SGML system must be able to accept (eg: typeset) text 
in that character set. Defining your own character set is not always a smart 
thing to do.

I have also seen some unconventional solutions to your problem. One such 
solution involved defining a special element (tag) that was used to enter 
accented characters. For example: <special>odiar</special>. 

Again, this element would have to be defined in the scope of the Document 
Type Definition and the SGML system would have to be capable of translating 
the 'odiar' text into the required accented character.

My recommendation is to use the SDATA entity solution if the accented 
character is not required in markup.

Regards,
	Aidan

inc@tc.fluke.COM (Gary Benson) (10/11/90)

In article <4410@swi.swi.psy.uva.nl> manfred@swi.psy.uva.nl () writes:

>Does SGML provide a way of specifying an 'Umlaut', as is used in German, in
>contrast wit a 'diaresis', used in many languages, such as the Dutch language.
>In printed text, i.e. typographically, these two are not distinguishable
>anymore. Note that an umlaut historically was printed as two vertical stripes
>(") on top of a vowel, whereas a diaresis should e printed as two dots (..) on
>top of a vowel.
>These two things are not the same, the umlaut changes the sound of the vowel,
>whereas the diaresis is used to mark a seperation in a word.

>All replies are greatly appreciated (references, ideas?)

Those two dots are not either an umlaut **OR** a diaresis  - - -  for
example, in Finnish, a letter "A" with two dotss above is a separate letter
altogether...A(two-dots) comes after Z in the alphabet! The same holds true
inother languages.... French for example uses "grave" and  "acute" "accents"
to make different letters, (Oh! look at the German "ss" that looks like an
English upper-case letter "B")

Instead of trying to get the world to see things the way some local language
sees it, don't you think we'd gain more by looking at symbols attached
to letters as part of the letter, rather than trying to define what they
mean in the language of origin? Really.

In Finnish, I canmnmot write write "maki" on this vt100 terminal.

"maki" is just a hill. In Finnish, it needs two dots above the "A" to make
it a real word. If you read it without the two dots, it means nothing.

So: in some languages at least, the two dots are not an intensive -- they
make the letter into a diofferent letter.

-- 
Gary Benson    -=[ S M I L E R ]=-   -_-_-_-inc@fluke.com_-_-_-_-_-_-_-_-_-_-

He who shits on the road will meet flies on his return.  -South African Proverb