ok@cs.mu.oz.au (Richard O'Keefe) (10/13/89)
Consider the problems of someone trying to write Prolog code which handles words in a language other than English. There are at least four contexts where the code may be used: - a national variant of ISO 646 - ISO 8859/1 (or DEC MNCS, which is very close) - MS-DOS character set - Macintosh character set {I omit all discussion of ISO 8859/N for N > 1 and alternative character sets on the Macintosh or OS/2; not because I don't know about these things but because this is hard enough already.} Quintus Prolog already lets you write the magic number you need, using C-style escapes. For example, the Old English word which became "whether" is spelled h,w,ae,eth,e,r, which we could write as 'hw\xE6\xF0er'. The problems with this are (a) it is very hard to tell which letters are intended by looking at hex (b) if the same program is moved to another system which uses a different coding, the numbers stay put, which means that you get different characters (c) the other system may not have any coding for these characters at all, but you aren't warned. Why not write the characters you want directly? (a) It may not be possible. Some editors on the PC do not give you direct access to the upper 128 characters. (b) It is even less portable than writing escape codes. If Prolog is to be used for writing programs that can be *portable* between these environments, it is important that we should have some way of indicating which characters we mean, so that they may be mapped correctly and a warning may be given when they cannot be mapped. The best scheme I have been able to come up with uses escape sequences like \: <first letter> <second letter> for ligatures (ae, oe) and some others: lower and upper case thorn -> th, TH, ess-tset -> ss; and the copyright symbol is \:co \: <letter> <diactrical> <diactrical> is ` ^ ' " ~ , / . - for grave, circumflex, acute, umlaut, tilde, cedilla, slash, ring, macron (I should be so lucky) \: <other> <other> E.g. \:!! and \:?? for inverted ! and ?, \:<< and \:>> for Continental quotes. "whether" would look like 'hw\:ae\:d/er' in this coding (I'm tempted to code eth/Eth as dh/DH similar to thorn). It is hard to read, but it is still better than 'hw\xE6\xF0er', and it means that if we read the code in a system which hasn't got ash or eth the tokeniser can print an error message and substitute something `close' (eth and thorn -> t, ash -> e, \:<letter><diacritical> drops the diacritical mark, \:<other><other> turns into <other>). Having '\:<<hw\:ae\:ther\:>>' converted to '<hweter>' is a lot better than having it converted to garbage, particularly if you get an error message when it happens. I want to stress that I don't regard this as anything other than a practical compromise; it would be better if the MS-DOS and Mac character sets would dry up and blow away so that everyone was using ISO 8859/* from now on, but that just isn't going to happen, and I think we need a better way of coping than we have now. So what's the question? The question is whether diactrical marks should precede or follow the letter they modify. I prefer \:e' because I read it as "e-acute" and so expect the diactrical mark second. But I believe there is a French convention that involves writing the diactrical mark first. There's also a question about whether the characters I picked for the diacritical marks are ok. I was hoping that the BSI committee might be relied on to do something about this problem (it is, after all, a syntax problem), but (a) they haven't and (b) one of the latest documents I have claims that escape sequences aren't needed inside atoms anyway, so I think we have to do it ourselves, and do it soon. If anyone can come up with a better suggestion, please do. But remember that it has to cover all the letters in the ISO 8859/1, MS-DOS and Mac character sets, and should be a wee bit open-ended in case we've missed something.
alberto@tove.umd.edu (Jose Alberto Fernandez R) (10/15/89)
The question is whether diactrical marks should precede or follow the letter they modify. I prefer \:e' because I read it as "e-acute" and so expect the diactrical mark second. But I believe there is a French convention that involves writing the diactrical mark first. Well, historically on typewriters the marks are typed first and then the caracter this is for mechanical reasons (the mark does not move the page and the paper moves when the marked letter is typed). On the other hand, TeX and LaTeX have defined a code for this marks and at least for the folks that use [La]TeX will be nice if you only need to learn one convension. Jose Alberto. -- :/ \ Jose Alberto Fernandez R | INTERNET: alberto@cs.umd.edu :| o o | Dept. of Computer Sc. | :| ^ | Univesity of Marylad | :\ \_/ / College Park, MD 20742 |
ok@cs.mu.oz.au (Richard O'Keefe) (10/15/89)
I wrote:
: The question is whether diactrical marks should precede or follow the
: letter they modify. I prefer \:e' because I read it as "e-acute" and so
: expect the diactrical mark second. But I believe there is a French
: convention that involves writing the diactrical mark first.
In article <ALBERTO.89Oct14145937@tove.umd.edu>,
alberto@tove.umd.edu (Jose Alberto Fernandez R) wrote:
: Well, historically on typewriters the marks are typed first and then
: the character this is for mechanical reasons (the mark does not move
: the page and the paper moves when the marked letter is typed).
I still have a typewriter with non-advancing keys.
: On the other hand, TeX and LaTeX have defined a code for these marks
: and at least for the folks that use [La]TeX it will be nice if you only
: need to learn one convention.
For the benefit of those without [La]TeX, here is the LaTex scheme:
Code Meaning Present use in Prolog
\`x grave accent (available)
\'x acute accent a quote (') followed by x
\^x circumflex control-X
\"x umlaut/diaeresis a quote (") followed by x
\~x tilde (available)
\=x macron (overbar) (available)
\.x dot (available)
\ux breve (reserved)
\vx "v" accent (reserved)
\Hx two acutes (reserved)
\txy "tie" over x and y tab followed by x and y
\cx cedilla continuation followed by x
\dx dot underneath DEL followed by x
\bx underbar backspace followed by x
\oe oe ligature (reserved)
\ae ae ligature ("ash") BEL followed by e
\aa a ring BEL followed by e
\o slashed o (reserved)
\l slashed l (reserved)
\ss ess-tset space followed by s
\pounds pound sterling sign (reserved)
\copyright copyright sign continuation followed by opyright
\S section sign space
\P pilcrow (reserved)
?` upside-down ? ? followed by `
!` upside-down ! ! followed by `
We cannot use this scheme, because too many of the sequences are already
in use. The ?` and !` ligatures in TeX would be particularly painful to
add to Prolog. TeX permits the construction of accented characters which
have no counterpart in ISO 8859/1, the MS-DOS character set, or the Mac
character set. That's fine, no problem. The thing which *really* makes
it unacceptable is that it has no way of expressing some of the
characters which ARE in the ISO 8859/1 character set, such as eth and
thorn, guillemots, Yen sign, ...
On the other hand, with the example of mechanical typewriters, [La]TeX,
and a French scheme I've seen, it does appear that putting the accents
first would be more consistent with "existing practice". Too bad.
Using \:'e for e-acute would not be so much unlike \'e that a TeXnician
would be confused, I hope.
ted@nmsu.edu (Ted Dunning) (10/16/89)
In article <2432@munnari.oz.au> ok@cs.mu.oz.au (Richard O'Keefe) writes:
We cannot use this scheme, because too many of the sequences are
already in use. The ?` and !` ligatures in TeX would be
particularly painful to add to Prolog.
prolog's difficulty in dealing with the european character sets is
nothing compared with the genuine antipathy with which it regards
oriental character sets. for instance, in quintus, put strips the
high bit of characters being output, and the contents of string
literals are stripped of their high bits by the guts of read. this
leads to real pain in trying to write a program which has embedded
chinese or japanese characters in it.
of course, the real fix is not to just put in a hack which avoids all
this gratuitous bit stripping. what should be done is to start
supporting characters as a data type distinct from integers, both from
tiny character sets such as used by the european languages, and from
larger character sets such as chinese, japanese, korean, and the indic
languages.
The thing which *really* makes it unacceptable is that it has no
way of expressing some of the characters which ARE in the ISO
8859/1 character set, such as eth and thorn, guillemots, Yen sign,
...
that is only the beginning. why admit there is such a thing as a yen
if you won't admit that kanji exists?
--
ted@nmsu.edu
Dem Dichter war so wohl daheime
In Schildas teurem Eichenhain!
Dort wob ich meine zarten Reime
Aus Veilchenduft und Mondenschein
ok@cs.mu.oz.au (Richard O'Keefe) (10/16/89)
In article <TED.89Oct15122042@kythera.nmsu.edu>, ted@nmsu.edu (Ted Dunning) writes: > Prolog's difficulty in dealing with the european character sets is > nothing compared with the genuine antipathy with which it regards > oriental character sets. for instance, in Quintus, put strips the > high bit of characters being output, and the contents of string > literals are stripped of their high bits by the guts of read. this > leads to real pain in trying to write a program which has embedded > chinese or japanese characters in it. Yes, there were old versions of Quintus Prolog which did this, but it hasn't been true for a long time. What's more, Quintus Prolog supports Kanji under VAX/VMS and VAX/Ultrix to my certain knowledge and may (I've been away from Quintus for a while) support Kanji on other platforms as well, certainly it has no difficulty with the "Shift-JIS" coding, the main problem is that the coming thing is EUUC. But Quintus intend to support EUUC as well as Shift-JIS. > that is only the beginning. why admit there is such a thing as a yen > if you won't admit that kanji exists? Well, I'm not speaking for Quintus, I'm just speaking for myself, and I insisted back in 1984 that the Prolog standard should support Kanji. Quintus does admit that Kanji exists and has supported it for years. (You do have to buy a special version. Send mail to sales@quintus.com.)
cdsm@sappho.doc.ic.ac.uk (Chris Moss) (10/17/89)
Richard O'Keefe writes: >Consider the problems of someone trying to write Prolog code >which handles words in a language other than English. Your message prompted me to look at the latest Japanes proposal that was sent out by Roger Scowen on 2 Oct, just before the Ottawa meeting of the ISO Prolog standardization committee. (Richard, they sent out your comments on I/O in the same mailing) It's called "Multi-octet character sets in Prolog" by Makoto Negishi, Yoshitomi Marisawa, Morihiko Tajima and Katsuhiko Nakamura, dated Sep. 1989. I will try and summarise the proposals, and add my comments, indented. 1. It adds an "extended identifier indicator char" to the definition of "identifier token" which is "implementation defined". "For example it may include small letter a with grave accent, small letter a with acute accent, etc. and Japanese characters". It similarly adds an "extended variable indicator char" for starting variables. i.e. _any_ characters can be added for atoms within a strict definition of the standard. This would seem to make portability of programs across national boundaries rather nightmarish. 2. Collating sequence. It suggests the standard should only define an alphabetical ordering within three groups of characters - small letters, capital letters and digits. Anything else is based on an extended collating sequence which is implementation defined. This thus seems to throw away even the rather ill-defined "subset of ISO 8859" which is referred to in 7.5 of the N40 document. Presumably any characterset, even EBCDIC, would qualify. 3. Character equivalence. They define a bip called "set_equivalence_char" which maps characters which equivalences extended characters into the base character set. A call to this predicate sets up a dynamic equivalence. I assume this is basically for input routines - if one gets a multi-octet character which is also in the basic character set (8859?) then it is automatically converted. They suggest it can also be used italic characters etc., and this wouldn't be symmetrical on output. They don't address the way in which strings represent multi-octet characters except by example - they refer to N32 and N34 which I don't appear to have received (the numbers refer to the ISO numbering for standardization documents). Examples are " $@!N (J" and " $@#A (J". They mostly assume the use of the Japanese standard JIS X 0208. ----------------- Comment: As far as I can see, these totally miss solving any of the problems! How can one scan a program if one doesn't know what characters are used in atoms, variables etc.? One needs some type of declarations to tell the processor what to expect. I don't know why the representation of octets in strings is so strange, maybe someone can enlighten me. But it doesn't solve any of Richard's problems. I could post the document to the net, tho it appears to be missing some figures. So much for now! Chris Moss cdsm@doc.ic.ac.uk
ok@cs.mu.oz.au (Richard O'Keefe) (10/17/89)
In article <1067@gould.doc.ic.ac.uk>, cdsm@sappho.doc.ic.ac.uk (Chris Moss) writes: > Richard O'Keefe writes: > >Consider the problems of someone trying to write Prolog code > >which handles words in a language other than English. > Your message prompted me to look at the latest Japanes proposal that > was sent out by Roger Scowen on 2 Oct, just before the Ottawa meeting > of the ISO Prolog standardization committee. > (Richard, they sent out your comments on I/O in the same mailing) If the "comments on I/O" means the note I wrote to Roger Scowen pointing out that "current input and current output are always valid streams, no matter how files are closed" is an important invariant whose preservation ought to be explicitly demanded by the standard, that was PRIVATE MAIL not intended for publication, and distributed without my knowledge or permission. I have already taken a lot of flack from Quintus because they thought I was attacking LPA (which I wasn't, quite the opposite). Thanks to Chris Moss for posting his comments. I really don't see what is supposed to be so hard about Kanji. Quintus Prolog supported Kanji on the Xerox Lisp machines (well, it still does if anyone is supporting the hardware...) and supports Kanji under Vax/VMS and Vax/Ultrix, and may do so on other systems by now. When Quintus did that, the C standard hadn't tackled multi-octet (why OCTet? why can't I have an 18-bit character set?) characters. Now that "wide" characters ARE tackled in the C standard (wchar_t and friends), it is extremely important that whatever is decided for Prolog should not be too different from C (for the simple reason that Prolog and C programs will have to read each other's files). I suggest that the BSI/ISO committee should extract the relevant parts of the current ANSI C draft (with ANSI's permission, of course) and mail the extracts to the Prolog standard mailing list. The problem of dealing with a SINGLE character set (whether it be 7 bit, 8 bit, or 16 bit) is fairly straightforward. The problem I am concerned with is porting source code for any one Western European language between the three incompatible 8-bit character sets we already have. > 2. Collating sequence. It suggests the standard should only define an > alphabetical ordering within three groups of characters - small letters, > capital letters and digits. Anything else is based on an extended > collating sequence which is implementation defined. This is silly. Different European languages collate the same symbols differently. (Think about the Spanish rule for "ll".) If you want locale-dependent collating, you are talking about a relation between character SEQUENCES, not single characters. Since 1987 at the latest I have been saying that the Prolog standard ought to have two separate comparison predicates: compare(R, X, Y) -- as at present, where the relative order of two texts of the same type is the same as the relative order of the lists of integers representing their names collate(R, X, Y) -- locale-dependent ordering, relative order of texts is not necessarily reducible to an ordering on characters; should sort lower and upper case together, e.g. stra\:sse and STRASSE should be similar. (Yes, one of those words has 5 characters and the other 6, but they differ only in case...) The distinction is of great practical importance: to obtain fast Prolog programs in a wide range of applications we *MUST* have ***FAST*** comparison. collate/3 is likely to be slow. So setof/3 should use the fast comparison. [My postings to this group on this topic may be reproduced by anyone for any purpose.]
alberto@tove.umd.edu (Jose Alberto Fernandez R) (10/17/89)
We cannot use this scheme, because too many of the sequences are already in use. The ?` and !` ligatures in TeX would be particularly painful to add to Prolog. Well, we don't need to agree in all the sequences, but if most of them can be similar, that is at least something. The thing which *really* makes it unacceptable is that it has no way of expressing some of the characters which ARE in the ISO 8859/1 character set, such as eth and thorn, guillemots, Yen sign, ... We can define our own in this case. Using \:'e for e-acute would not be so much unlike \'e that a TeXnician would be confused, I hope. That my whole idea! The sequences does not need to be exactly the same, but close enough such that people can remember easy. By the way the idea to represent the invert ? by \:?? is preaty good. Jose Alberto. -- :/ \ Jose Alberto Fernandez R | INTERNET: alberto@cs.umd.edu :| o o | Dept. of Computer Sc. | :| ^ | Univesity of Marylad | :\ \_/ / College Park, MD 20742 |