[comp.text.tex] Ascii German -> Diacritics

dorai@tone.rice.edu (Dorai Sitaram) (10/20/90)

If you're interested in a teensy program that converts "Ascii German"
into "diacritical German", you might want to read this.  As I'm pretty
certain the demand will be rather small, I won't post it.  Send me
email if you want the program.

Description:

\begin{nomenclature}

Ascii German (AG): the 26-letter German (orthography) commonly used on
"regular" keyboards, viz., ae, oe, ue, ss for the umlauted vowels and
eszet.

Diacritical German (DG): the real stuff, viz., what \"a, \"o, \"u, \ss
produce in TeX's output.

\end{nomenclature}

Why?

A lot of people comfortably read and write AG on keyboards instead of
using TeX source (\"a) or even german.sty kind of source ("a).
However, when such text eventually gets converted to something which
allows visible umlauts and eszet, we'd like to convert our AG into the
proper diacritical form.

This is slightly tricky since, although coding DG into AG is 1-1, the
reverse (decoding) isn't.  One certainly can't convert all occurrences
of ae, oe, ue, ss into the corresponding special characters.  E.g.,
Dauer != Da\"ur; wissen != wi{\ss}en; etc.

I brewed a very little lex program "diac" which converts AG into DG
using context information to figure out which ae/oe/ue/ss get
converted.  The output is TeX source style.

There is also an easy way to invoke "diac" as a preprocessor for
LaTeX, so you can write your .tex files in AG style but get 
.dvi output with flawless diacritics.

Except for Masse/Ma{\ss}e. :-] 

Actually, I have very certainly not exhausted all the patterns that
are recognizable, though the present program seems fine on some test
data I have.  Luckily, it's easy to add new patterns (and to know what
patterns to add) in the event of future bloopers.

This program isn't necessarily restricted to German.  It should be
easily modifiable to any other language that uses diacritics and has a
popular or idiosyncratic Ascii encoding.  (E.g., an Indian language in
Ascii w.r.t. any standard representation in Roman diacritics.)
Moreover, as you get the hang of the patterns used, you can devise
your own patterns to get better decodings.

--d

dorai@egeria.rice.edu (Dorai Sitaram) (10/22/90)

>If you're interested in a teensy program that converts "Ascii German"
>into "diacritical German", you might want to read this.  [...]

It ("most current version") is now available as the shar file
public/ascii2german.sh via anonymous ftp from titan.rice.edu.

Email me if (when?) you find bugs.  (If you think nobody should be
writing "ascii" German in the first place, I sympathize, but as the
great Huey Lewis once intoned, there is no perfect world anway.)

--d

icking@gmdzi.gmd.de (Werner Icking) (10/30/90)

dorai@tone.rice.edu (Dorai Sitaram) writes:
[...]
>I brewed a very little lex program "diac" which converts AG into DG
>using context information to figure out which ae/oe/ue/ss get
>converted.  The output is TeX source style.
[...]
>Except for Masse/Ma{\ss}e. :-] 

And what is with names like Mueller, which have to be written with ue if
the "owner" insists. There are other problems with names, because e and 
sometimes i is used for making the preceeding vowel longer: Buer, Buir,
Roisdorf, Troisdorf, Naefe - a collegue of mine. Other problems result from
the possibility to combine two or more words freely: Portoerstattung,
Kontoerfassung, etc.
-- 
Werner Icking          icking@gmdzi.gmd.de          (+49 2241) 14-2443
Gesellschaft fuer Mathematik und Datenverarbeitung mbH (GMD)
Schloss Birlinghoven, P.O.Box 1240, D-5205 Sankt Augustin 1, FRGermany

dorai@egeria.rice.edu (Dorai Sitaram) (11/01/90)

In article <3513@gmdzi.gmd.de> icking@gmdzi.gmd.de (Werner Icking) writes:
>dorai@tone.rice.edu (Dorai Sitaram) writes:
>[...]
>>I brewed a very little lex program "diac" which converts AG into DG
>>using context information to figure out which ae/oe/ue/ss get
>>converted.  The output is TeX source style.
>[...]
>>Except for Masse/Ma{\ss}e. :-] 
>
>And what is with names like Mueller, which have to be written with ue if
>the "owner" insists. There are other problems with names, because e and 
>sometimes i is used for making the preceeding vowel longer: Buer, Buir,
>Roisdorf, Troisdorf, Naefe - a collegue of mine. Other problems result from
>the possibility to combine two or more words freely: Portoerstattung,
>Kontoerfassung, etc.
>-- 
>Werner Icking          icking@gmdzi.gmd.de          (+49 2241) 14-2443
>Gesellschaft fuer Mathematik und Datenverarbeitung mbH (GMD)
>Schloss Birlinghoven, P.O.Box 1240, D-5205 Sankt Augustin 1, FRGermany

Das Programm ist nur so gut wie die Mustern die es besitzt.  Also, man
kann sehr leicht mit Woerter wie {Buer, Naefe, ..., Kontoerfassung}
umgehen.  Gustaf Neumann hat einen sehr umfangreichen Wortschatz fuer
`diac' entwickelt, und ich moechte nicht vieles darueber sagen, weil
Gustaf selbst bald darueber schreiben wird, wenn er mal ganz zufrieden
mit seinem Programm geworden ist.

Das Hauptproblem besteht nur darin, wenn ein Ascii-Wort zwei moegliche
`Uebersetzungen' hat, beide sinnvoll.  Z.B. Masse/Ma{\ss}e sowie
M\"uller/Mueller.  Ich gebe zu, dass ich keine richtige Loesung fuer
diese Ambiguitaet weiss.  Man kann auf jeden Fall `Mu{}eller' (sehr
haesslich :-[) schreiben, wenn die Buchstabierung mit `u-e' (nicht
`\"u') gewuenscht wird.  Man kann auch ein weiteres `Fenster' fuer die
Ascii-Zeichenketten benutzen (also, man beobachtet ganze Phrasen statt
nur Woerter).  Mit dieser Methode kann man sofort (ok, ok, es gibt
noch einige Schwierigkeiten) erkennen, dass `in hohem Masse' nur als
`in hohem Ma{\ss}e' zu uebersetzen ist.

Der Herr Mueller aber bleibt ein unsympatischer Mensch.

--d

(for comp.text.tex)

The program is only as good as the patterns it has.  Thus, one can
easily include the appropriate rules for {Buer, Naefe, ...,
Kontoerfassung}.  Gustaf Neumann has developed a very comprehensive
set of patterns, and I don't want to say too much about it, since
Gustaf will talk about it himself before long, as soon as he's
satisfied with his creation.

The chief problem, though, is when an Ascii word has two possible
transliterations, both of them meaningfull.  E.g., Masse/Ma{\ss}e and
M\"uller/Mueller.  I cheerfully agree I haven't found a neat solution
for this ambiguity.  Of course, one could get around this by writing
`Mu{}eller' (eek!) when one wants the spelling with `u-e' (not `\"u').
One could also use a wider `window' when processing the Ascii text
(i.e., one observes phrases rather than just words).  With this
technique, one can immediately (ok, ok, there are problems nokh) that
`in hohem Masse' is only to be transliterated as `in hohem Ma{\ss}e'.

Mr Mueller, however, still remains a spoilsport.

--d