[comp.lang.prolog] Non-ASCII characters, suggestion and question

ok@cs.mu.oz.au (Richard O'Keefe) (10/13/89)

Consider the problems of someone trying to write Prolog code
which handles words in a language other than English.  There
are at least four contexts where the code may be used:

    -	a national variant of ISO 646
    -	ISO 8859/1 (or DEC MNCS, which is very close)
    -	MS-DOS character set
    -	Macintosh character set

{I omit all discussion of ISO 8859/N for N > 1 and alternative
character sets on the Macintosh or OS/2; not because I don't know
about these things but because this is hard enough already.}

Quintus Prolog already lets you write the magic number you need,
using C-style escapes.  For example, the Old English word which
became "whether" is spelled h,w,ae,eth,e,r, which we could write
as 'hw\xE6\xF0er'.  The problems with this are

(a) it is very hard to tell which letters are intended by looking at hex
(b) if the same program is moved to another system which uses a different
    coding, the numbers stay put, which means that you get different
    characters
(c) the other system may not have any coding for these characters at
    all, but you aren't warned.

Why not write the characters you want directly?
(a) It may not be possible.  Some editors on the PC do not give you
    direct access to the upper 128 characters.
(b) It is even less portable than writing escape codes.

If Prolog is to be used for writing programs that can be *portable*
between these environments, it is important that we should have some
way of indicating which characters we mean, so that they may be mapped
correctly and a warning may be given when they cannot be mapped.

The best scheme I have been able to come up with uses escape sequences
like
	\: <first letter> <second letter>
		for ligatures (ae, oe) and some others: lower and
		upper case thorn -> th, TH, ess-tset -> ss; and the
		copyright symbol is \:co
		
	\: <letter> <diactrical>
		<diactrical> is ` ^ ' " ~ , / . -
		for grave, circumflex, acute, umlaut, tilde,
		cedilla, slash, ring, macron (I should be so lucky)

	\: <other> <other>
		E.g. \:!! and \:?? for inverted ! and ?, \:<< and \:>>
		for Continental quotes.


"whether" would look like 'hw\:ae\:d/er' in this coding (I'm tempted to
code eth/Eth as dh/DH similar to thorn).  It is hard to read, but it is
still better than 'hw\xE6\xF0er', and it means that if we read the code
in a system which hasn't got ash or eth the tokeniser can print an
error message and substitute something `close' (eth and thorn -> t,
ash -> e, \:<letter><diacritical> drops the diacritical mark,
\:<other><other> turns into <other>).  Having '\:<<hw\:ae\:ther\:>>'
converted to '<hweter>' is a lot better than having it converted to garbage,
particularly if you get an error message when it happens.

I want to stress that I don't regard this as anything other than a
practical compromise; it would be better if the MS-DOS and Mac character
sets would dry up and blow away so that everyone was using ISO 8859/*
from now on, but that just isn't going to happen, and I think we need a
better way of coping than we have now.

So what's the question?

The question is whether diactrical marks should precede or follow the
letter they modify.  I prefer \:e' because I read it as "e-acute" and so
expect the diactrical mark second.  But I believe there is a French
convention that involves writing the diactrical mark first.

There's also a question about whether the characters I picked for the
diacritical marks are ok.

I was hoping that the BSI committee might be relied on to do something
about this problem (it is, after all, a syntax problem), but (a) they
haven't and (b) one of the latest documents I have claims that escape
sequences aren't needed inside atoms anyway, so I think we have to do it
ourselves, and do it soon.

If anyone can come up with a better suggestion, please do.  But remember
that it has to cover all the letters in the ISO 8859/1, MS-DOS and Mac
character sets, and should be a wee bit open-ended in case we've missed
something.

alberto@tove.umd.edu (Jose Alberto Fernandez R) (10/15/89)

 The question is whether diactrical marks should precede or follow the
 letter they modify.  I prefer \:e' because I read it as "e-acute" and so
 expect the diactrical mark second.  But I believe there is a French
 convention that involves writing the diactrical mark first.

Well, historically on typewriters the marks are typed first and then
the caracter this is for mechanical reasons (the mark does not move
the page and the paper moves when the marked letter is typed).

On the other hand, TeX and LaTeX have defined a code for this marks
and at least for the folks that use [La]TeX will be nice if you only
need to learn one convension. 

	Jose Alberto.

--
:/       \ Jose Alberto Fernandez R | INTERNET: alberto@cs.umd.edu
:| o   o | Dept. of Computer Sc.    | 
:|   ^   | Univesity of Marylad     | 
:\  \_/  / College Park, MD 20742   |

ok@cs.mu.oz.au (Richard O'Keefe) (10/15/89)

I wrote:
:  The question is whether diactrical marks should precede or follow the
:  letter they modify.  I prefer \:e' because I read it as "e-acute" and so
:  expect the diactrical mark second.  But I believe there is a French
:  convention that involves writing the diactrical mark first.

In article <ALBERTO.89Oct14145937@tove.umd.edu>,
alberto@tove.umd.edu (Jose Alberto Fernandez R) wrote:
: Well, historically on typewriters the marks are typed first and then
: the character this is for mechanical reasons (the mark does not move
: the page and the paper moves when the marked letter is typed).

I still have a typewriter with non-advancing keys.

: On the other hand, TeX and LaTeX have defined a code for these marks
: and at least for the folks that use [La]TeX it will be nice if you only
: need to learn one convention.

For the benefit of those without [La]TeX, here is the LaTex scheme:

	Code	Meaning			Present use in Prolog
	\`x	grave accent		(available)
	\'x	acute accent		a quote (') followed by x
	\^x	circumflex		control-X
	\"x	umlaut/diaeresis	a quote (") followed by x
	\~x	tilde			(available)
	\=x	macron (overbar)	(available)
	\.x	dot			(available)
	\ux	breve			(reserved)
	\vx	"v" accent		(reserved)
	\Hx	two acutes		(reserved)
	\txy	"tie" over x and y	tab followed by x and y
	\cx	cedilla			continuation followed by x
	\dx	dot underneath		DEL followed by x
	\bx	underbar		backspace followed by x
	\oe	oe ligature		(reserved)
	\ae	ae ligature ("ash")	BEL followed by e
	\aa	a ring			BEL followed by e
	\o	slashed o		(reserved)
	\l	slashed l		(reserved)
	\ss	ess-tset		space followed by s	
	\pounds	pound sterling sign	(reserved)
	\copyright  copyright sign	continuation followed by opyright
	\S	section sign		space
	\P	pilcrow			(reserved)
	?`	upside-down ?		? followed by `
	!`	upside-down !		! followed by `

We cannot use this scheme, because too many of the sequences are already
in use.  The ?` and !` ligatures in TeX would be particularly painful to
add to Prolog.  TeX permits the construction of accented characters which
have no counterpart in ISO 8859/1, the MS-DOS character set, or the Mac
character set.  That's fine, no problem.  The thing which *really* makes
it unacceptable is that it has no way of expressing some of the
characters which ARE in the ISO 8859/1 character set, such as eth and
thorn, guillemots, Yen sign, ...

On the other hand, with the example of mechanical typewriters, [La]TeX,
and a French scheme I've seen, it does appear that putting the accents
first would be more consistent with "existing practice".  Too bad.

Using \:'e for e-acute would not be so much unlike \'e that a TeXnician
would be confused, I hope.

ted@nmsu.edu (Ted Dunning) (10/16/89)

In article <2432@munnari.oz.au> ok@cs.mu.oz.au (Richard O'Keefe) writes:


   We cannot use this scheme, because too many of the sequences are
   already in use.  The ?` and !` ligatures in TeX would be
   particularly painful to add to Prolog.

prolog's difficulty in dealing with the european character sets is
nothing compared with the genuine antipathy with which it regards
oriental character sets.  for instance, in quintus, put strips the
high bit of characters being output, and the contents of string
literals are stripped of their high bits by the guts of read.  this
leads to real pain in trying to write a program which has embedded
chinese or japanese characters in it.

of course, the real fix is not to just put in a hack which avoids all
this gratuitous bit stripping.  what should be done is to start
supporting characters as a data type distinct from integers, both from
tiny character sets such as used by the european languages, and from
larger character sets such as chinese, japanese, korean, and the indic
languages.

   The thing which *really* makes it unacceptable is that it has no
   way of expressing some of the characters which ARE in the ISO
   8859/1 character set, such as eth and thorn, guillemots, Yen sign,
   ...

that is only the beginning.  why admit there is such a thing as a yen
if you won't admit that kanji exists?



--
ted@nmsu.edu
			Dem Dichter war so wohl daheime
			In Schildas teurem Eichenhain!
			Dort wob ich meine zarten Reime
			Aus Veilchenduft und Mondenschein

ok@cs.mu.oz.au (Richard O'Keefe) (10/16/89)

In article <TED.89Oct15122042@kythera.nmsu.edu>, ted@nmsu.edu (Ted Dunning) writes:
> Prolog's difficulty in dealing with the european character sets is
> nothing compared with the genuine antipathy with which it regards
> oriental character sets.  for instance, in Quintus, put strips the
> high bit of characters being output, and the contents of string
> literals are stripped of their high bits by the guts of read.  this
> leads to real pain in trying to write a program which has embedded
> chinese or japanese characters in it.

Yes, there were old versions of Quintus Prolog which did this,
but it hasn't been true for a long time.  What's more, Quintus
Prolog supports Kanji under VAX/VMS and VAX/Ultrix to my
certain knowledge and may (I've been away from Quintus for a
while) support Kanji on other platforms as well, certainly it
has no difficulty with the "Shift-JIS" coding, the main problem
is that the coming thing is EUUC.  But Quintus intend to support
EUUC as well as Shift-JIS.

> that is only the beginning.  why admit there is such a thing as a yen
> if you won't admit that kanji exists?

Well, I'm not speaking for Quintus, I'm just speaking for myself, and
I insisted back in 1984 that the Prolog standard should support Kanji.
Quintus does admit that Kanji exists and has supported it for years.
(You do have to buy a special version.  Send mail to sales@quintus.com.)

cdsm@sappho.doc.ic.ac.uk (Chris Moss) (10/17/89)

Richard O'Keefe writes:
>Consider the problems of someone trying to write Prolog code
>which handles words in a language other than English.  

Your message prompted me to look at the latest Japanes proposal that
was sent out by Roger Scowen on 2 Oct, just before the Ottawa meeting
of the ISO Prolog standardization committee.
(Richard, they sent out your comments on I/O in the same mailing)
It's called "Multi-octet character sets in Prolog" by Makoto Negishi,
Yoshitomi Marisawa, Morihiko Tajima and Katsuhiko Nakamura, dated Sep. 1989.

I will try and summarise the proposals, and add my comments, indented.

1. It adds an "extended identifier indicator char" to the definition of
"identifier token" which is "implementation defined". "For example it may
include small letter a with grave accent, small letter a with acute accent,
etc. and Japanese characters". It similarly adds an "extended variable
indicator char" for starting variables.

    i.e. _any_ characters can be added for atoms within a strict definition
    of the standard. This would seem to make portability of programs across
    national boundaries rather nightmarish.

2. Collating sequence. It suggests the standard should only define an
alphabetical ordering within three groups of characters - small letters,
capital letters and digits. Anything else is based on an extended
collating sequence which is implementation defined.

    This thus seems to throw away even the rather ill-defined "subset of
    ISO 8859" which is referred to in 7.5 of the N40 document. Presumably
    any characterset, even EBCDIC, would qualify.

3. Character equivalence. They define a bip called "set_equivalence_char"
which maps characters which equivalences extended characters into
the base character set. A call to this predicate sets up a dynamic
equivalence.

    I assume this is basically for input routines - if one gets a multi-octet
    character which is also in the basic character set (8859?) then it
    is automatically converted. They suggest it can also be used italic
    characters etc., and this wouldn't be symmetrical on output.

They don't address the way in which strings represent multi-octet characters
except by example - they refer to N32 and N34 which I don't appear to have
received  (the numbers refer to the ISO numbering for standardization
documents).  Examples are " $@!N (J" and " $@#A (J".  They mostly assume the
use of the Japanese standard JIS X 0208.

    -----------------

Comment:
As far as I can see, these totally miss solving any of the problems!
How can one scan a program if one doesn't know what characters are used in
atoms, variables etc.? One needs some type of declarations to tell the
processor what to expect. I don't know why the representation of octets in
strings is so strange, maybe someone can enlighten me. But it doesn't solve any
of Richard's problems.

I could post the document to the net, tho it appears to be missing some
figures.

So much for now!
Chris Moss cdsm@doc.ic.ac.uk

ok@cs.mu.oz.au (Richard O'Keefe) (10/17/89)

In article <1067@gould.doc.ic.ac.uk>, cdsm@sappho.doc.ic.ac.uk (Chris Moss) writes:
> Richard O'Keefe writes:
> >Consider the problems of someone trying to write Prolog code
> >which handles words in a language other than English.  

> Your message prompted me to look at the latest Japanes proposal that
> was sent out by Roger Scowen on 2 Oct, just before the Ottawa meeting
> of the ISO Prolog standardization committee.
> (Richard, they sent out your comments on I/O in the same mailing)

If the "comments on I/O" means the note I wrote to Roger Scowen
pointing out that "current input and current output are always
valid streams, no matter how files are closed" is an important
invariant whose preservation ought to be explicitly demanded by
the standard, that was PRIVATE MAIL not intended for publication,
and distributed without my knowledge or permission.
I have already taken a lot of flack from Quintus because they
thought I was attacking LPA (which I wasn't, quite the opposite).

Thanks to Chris Moss for posting his comments.

I really don't see what is supposed to be so hard about Kanji.
Quintus Prolog supported Kanji on the Xerox Lisp machines (well, it
still does if anyone is supporting the hardware...) and supports Kanji
under Vax/VMS and Vax/Ultrix, and may do so on other systems by now.

When Quintus did that, the C standard hadn't tackled multi-octet
(why OCTet? why can't I have an 18-bit character set?) characters.
Now that "wide" characters ARE tackled in the C standard (wchar_t
and friends), it is extremely important that whatever is decided
for Prolog should not be too different from C (for the simple reason
that Prolog and C programs will have to read each other's files).
I suggest that the BSI/ISO committee should extract the relevant
parts of the current ANSI C draft (with ANSI's permission, of course)
and mail the extracts to the Prolog standard mailing list.

The problem of dealing with a SINGLE character set (whether it be 7 bit,
8 bit, or 16 bit) is fairly straightforward.  The problem I am concerned
with is porting source code for any one Western European language between
the three incompatible 8-bit character sets we already have.

> 2. Collating sequence. It suggests the standard should only define an
> alphabetical ordering within three groups of characters - small letters,
> capital letters and digits. Anything else is based on an extended
> collating sequence which is implementation defined.

This is silly.  Different European languages collate the same symbols
differently.  (Think about the Spanish rule for "ll".)  If you want
locale-dependent collating, you are talking about a relation between
character SEQUENCES, not single characters.  Since 1987 at the latest
I have been saying that the Prolog standard ought to have two separate
comparison predicates:
	compare(R, X, Y)
		-- as at present, where the relative order of two texts
	 	   of the same type is the same as the relative order of
		   the lists of integers representing their names
	collate(R, X, Y)
		-- locale-dependent ordering, relative order of texts is
		   not necessarily reducible to an ordering on characters;
		   should sort lower and upper case together, e.g.
		   stra\:sse and STRASSE should be similar.  (Yes, one of
		   those words has 5 characters and the other 6, but they
		   differ only in case...)

The distinction is of great practical importance:  to obtain fast Prolog
programs in a wide range of applications we *MUST* have ***FAST***
comparison.  collate/3 is likely to be slow.  So setof/3 should use the
fast comparison.

[My postings to this group on this topic may be reproduced by anyone for
 any purpose.]

alberto@tove.umd.edu (Jose Alberto Fernandez R) (10/17/89)

  We cannot use this scheme, because too many of the sequences are already
  in use.  The ?` and !` ligatures in TeX would be particularly painful to
  add to Prolog.  

Well, we don't need to agree in all the sequences, but if most of them
can be similar, that is at least something.

  The thing which *really* makes
  it unacceptable is that it has no way of expressing some of the
  characters which ARE in the ISO 8859/1 character set, such as eth and
  thorn, guillemots, Yen sign, ...

We can define our own in this case.

  Using \:'e for e-acute would not be so much unlike \'e that a TeXnician
  would be confused, I hope.

That my whole idea! The sequences does not need to be exactly the
same, but close enough such that people can remember easy. By the way
the idea to represent the invert ? by \:?? is preaty good.

	Jose Alberto.

--
:/       \ Jose Alberto Fernandez R | INTERNET: alberto@cs.umd.edu
:| o   o | Dept. of Computer Sc.    | 
:|   ^   | Univesity of Marylad     | 
:\  \_/  / College Park, MD 20742   |