dorai@tone.rice.edu (Dorai Sitaram) (10/20/90)
If you're interested in a teensy program that converts "Ascii German" into "diacritical German", you might want to read this. As I'm pretty certain the demand will be rather small, I won't post it. Send me email if you want the program. Description: \begin{nomenclature} Ascii German (AG): the 26-letter German (orthography) commonly used on "regular" keyboards, viz., ae, oe, ue, ss for the umlauted vowels and eszet. Diacritical German (DG): the real stuff, viz., what \"a, \"o, \"u, \ss produce in TeX's output. \end{nomenclature} Why? A lot of people comfortably read and write AG on keyboards instead of using TeX source (\"a) or even german.sty kind of source ("a). However, when such text eventually gets converted to something which allows visible umlauts and eszet, we'd like to convert our AG into the proper diacritical form. This is slightly tricky since, although coding DG into AG is 1-1, the reverse (decoding) isn't. One certainly can't convert all occurrences of ae, oe, ue, ss into the corresponding special characters. E.g., Dauer != Da\"ur; wissen != wi{\ss}en; etc. I brewed a very little lex program "diac" which converts AG into DG using context information to figure out which ae/oe/ue/ss get converted. The output is TeX source style. There is also an easy way to invoke "diac" as a preprocessor for LaTeX, so you can write your .tex files in AG style but get .dvi output with flawless diacritics. Except for Masse/Ma{\ss}e. :-] Actually, I have very certainly not exhausted all the patterns that are recognizable, though the present program seems fine on some test data I have. Luckily, it's easy to add new patterns (and to know what patterns to add) in the event of future bloopers. This program isn't necessarily restricted to German. It should be easily modifiable to any other language that uses diacritics and has a popular or idiosyncratic Ascii encoding. (E.g., an Indian language in Ascii w.r.t. any standard representation in Roman diacritics.) Moreover, as you get the hang of the patterns used, you can devise your own patterns to get better decodings. --d
dorai@egeria.rice.edu (Dorai Sitaram) (10/22/90)
>If you're interested in a teensy program that converts "Ascii German" >into "diacritical German", you might want to read this. [...] It ("most current version") is now available as the shar file public/ascii2german.sh via anonymous ftp from titan.rice.edu. Email me if (when?) you find bugs. (If you think nobody should be writing "ascii" German in the first place, I sympathize, but as the great Huey Lewis once intoned, there is no perfect world anway.) --d
icking@gmdzi.gmd.de (Werner Icking) (10/30/90)
dorai@tone.rice.edu (Dorai Sitaram) writes: [...] >I brewed a very little lex program "diac" which converts AG into DG >using context information to figure out which ae/oe/ue/ss get >converted. The output is TeX source style. [...] >Except for Masse/Ma{\ss}e. :-] And what is with names like Mueller, which have to be written with ue if the "owner" insists. There are other problems with names, because e and sometimes i is used for making the preceeding vowel longer: Buer, Buir, Roisdorf, Troisdorf, Naefe - a collegue of mine. Other problems result from the possibility to combine two or more words freely: Portoerstattung, Kontoerfassung, etc. -- Werner Icking icking@gmdzi.gmd.de (+49 2241) 14-2443 Gesellschaft fuer Mathematik und Datenverarbeitung mbH (GMD) Schloss Birlinghoven, P.O.Box 1240, D-5205 Sankt Augustin 1, FRGermany
dorai@egeria.rice.edu (Dorai Sitaram) (11/01/90)
In article <3513@gmdzi.gmd.de> icking@gmdzi.gmd.de (Werner Icking) writes: >dorai@tone.rice.edu (Dorai Sitaram) writes: >[...] >>I brewed a very little lex program "diac" which converts AG into DG >>using context information to figure out which ae/oe/ue/ss get >>converted. The output is TeX source style. >[...] >>Except for Masse/Ma{\ss}e. :-] > >And what is with names like Mueller, which have to be written with ue if >the "owner" insists. There are other problems with names, because e and >sometimes i is used for making the preceeding vowel longer: Buer, Buir, >Roisdorf, Troisdorf, Naefe - a collegue of mine. Other problems result from >the possibility to combine two or more words freely: Portoerstattung, >Kontoerfassung, etc. >-- >Werner Icking icking@gmdzi.gmd.de (+49 2241) 14-2443 >Gesellschaft fuer Mathematik und Datenverarbeitung mbH (GMD) >Schloss Birlinghoven, P.O.Box 1240, D-5205 Sankt Augustin 1, FRGermany Das Programm ist nur so gut wie die Mustern die es besitzt. Also, man kann sehr leicht mit Woerter wie {Buer, Naefe, ..., Kontoerfassung} umgehen. Gustaf Neumann hat einen sehr umfangreichen Wortschatz fuer `diac' entwickelt, und ich moechte nicht vieles darueber sagen, weil Gustaf selbst bald darueber schreiben wird, wenn er mal ganz zufrieden mit seinem Programm geworden ist. Das Hauptproblem besteht nur darin, wenn ein Ascii-Wort zwei moegliche `Uebersetzungen' hat, beide sinnvoll. Z.B. Masse/Ma{\ss}e sowie M\"uller/Mueller. Ich gebe zu, dass ich keine richtige Loesung fuer diese Ambiguitaet weiss. Man kann auf jeden Fall `Mu{}eller' (sehr haesslich :-[) schreiben, wenn die Buchstabierung mit `u-e' (nicht `\"u') gewuenscht wird. Man kann auch ein weiteres `Fenster' fuer die Ascii-Zeichenketten benutzen (also, man beobachtet ganze Phrasen statt nur Woerter). Mit dieser Methode kann man sofort (ok, ok, es gibt noch einige Schwierigkeiten) erkennen, dass `in hohem Masse' nur als `in hohem Ma{\ss}e' zu uebersetzen ist. Der Herr Mueller aber bleibt ein unsympatischer Mensch. --d (for comp.text.tex) The program is only as good as the patterns it has. Thus, one can easily include the appropriate rules for {Buer, Naefe, ..., Kontoerfassung}. Gustaf Neumann has developed a very comprehensive set of patterns, and I don't want to say too much about it, since Gustaf will talk about it himself before long, as soon as he's satisfied with his creation. The chief problem, though, is when an Ascii word has two possible transliterations, both of them meaningfull. E.g., Masse/Ma{\ss}e and M\"uller/Mueller. I cheerfully agree I haven't found a neat solution for this ambiguity. Of course, one could get around this by writing `Mu{}eller' (eek!) when one wants the spelling with `u-e' (not `\"u'). One could also use a wider `window' when processing the Ascii text (i.e., one observes phrases rather than just words). With this technique, one can immediately (ok, ok, there are problems nokh) that `in hohem Masse' is only to be transliterated as `in hohem Ma{\ss}e'. Mr Mueller, however, still remains a spoilsport. --d