[comp.lang.prolog] character type

ok@quintus.UUCP (Richard A. O'Keefe) (03/11/88)

In article <5337@utah-cs.UUCP>, shebs%defun.utah.edu.uucp@utah-cs.UUCP (Stanley T. Shebs) writes:
> In article <751@cresswell.quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
> > Comparing Pascal code I've seen
> > with C suggests that using ORD and CHR a lot isn't a good idea either.
> 
> What was it that made it a bad idea?  This is the first time I've heard
> this view expressed by a non-assembly programmer.
> 
I didn't say that a character type is a bad idea, I said that
	_using_      ORD and CHR     _a lot_
isn't a good idea.  What I mean by that is I've seen a lot of Pascal
code doing arithmetic operations on character codes, with lots of ORDs
and CHRs.  Having to convert a data type to something else to do
something with it is what I don't like.  Having half a data-type can be
worse than not having it at all.  One should either stop pretending (C's
choice) or provide an adequate set of operations (which Common Lisp attempts). 

I think the Common Lisp character abstractions aren't quite right either.
Trying to treat control-super-hyper-X as a single character is not quite
right.  (How many versions of Common Lisp have (> (char-bits-limit 1)) ?)

Xerox have a document called something like "The XNS Character Set
Standard" which describes their 16-bit character set.  I have a copy,
but can't find it just now to give you the proper reference.  It has a
lot of sensible stuff to say about characters in general.

For example, it seems obvious that the following should be true:
	forall X, lower-case-p(X) -->
	    exists Y, upper-case-p(Y) & char-upcase(X) = Y

	forall Y, upper-case-p(Y) -->
	    exist X, lower-case-p(X) & char-downcase(Y) = X
This is true in ASCII, and it's true in EBCDIC.  It isn't true in the
replacement for ASCII, however, ISO 8859/1.  Why?  Because of the German
"sz" letter (which is often mistaken for a lower-case beta).  The Common
Lisp designers were canny enough to avoid this:

	alpha-char-p(X)		= X is a letter
	lower-case-p(X)		= X is a lower-case letter
	upper-case-p(X)		= X is an upper-case letter
	both-case-p(X)		= X is lower-case and has an upper-case
				  version, or vice-versa.

So a character in Common Lisp may be a letter without having to belong to
either case (this seems right for Kanji), and may belong to one case without
having an equivalent in the other (which is right for "sz").

However, they weren't cautious enough to avoid this:

	forall X, string-char-p(X) ->
	    exists Y, char-upcase(X) = Y &
	        Y = X v
		lower-case-p(X) & upper-case-p(Y) & char-downcase(Y) = X

	forall Y, string-char-p(Y) ->
	    exists X, char-downcase(Y) = X &
		X = Y v
		lower-case-p(X) & upper-case-p(Y) & char-upcase(X) = Y

This is true even of ISO 8859/1, but it is _not_ true of the XNS character
set.  Why?  Because it includes the Greek alphabet, where lower-case sigma
has two forms (one at the end of words, one elsewhere) and upper-case sigma
has only one, and lower-case beta has two forms (one at the beginning of
words, one elsewhere) and capital beta has only one.  Case-shifting _is_ a
bijection on _words_ in that alphabet, but _not_ on single _characters_.
And of course "sz" comes back to haunt us:  up-casing a _word_ containing
an "sz" character doesn't leave that character alone, but replaces it by
"SS".

    If characters are represented by integers, then it is straightforward
to program up missing operations.  If characters are a separate data type,
but that data type is missing many of its "natural" operations, then you
wind up with murky code changing types all over the place.  So the question
is, what are the "natural" operations of the "character" data type (bearing
in mind that changing case doesn't seem to be one of them...)?  When doing
a lot of text processing in Common Lisp, what does one use (char-int C)
and (int-char I) for [other than indexing arrays]?

Quintus Prolog has in library(ctypes)
	is_alnum/1, is_alpha/1, is_ascii/1, is_cntrl/1, is_csym/1,
	is_csymf/1, is_digit/1, is_digit/3, is_endfile/1, is_endline/1,
	is_graph/1, is_lower/1, is_newline/1, is_newpage/1, is_paren/2,
	is_period/1, is_print/1, is_punct/1, is_quote/1, is_space/1,
	is_upper/1, to_lower/2, to_upper/2.
The debt to C should be obvious.   is_digit/3 is similar to Common Lisp's
(digit-char - - -), but was independently derived.
Quintus Prolog has in library(caseconv)
	lower/[1,2], mixed/[1,2], upper/[1,2]
which work on entire text objects, not characters.

I have just noticed while writing this message that we should have an
equivalent of Common Lisp's both-case-p.  What other operations are we
missing?  (Apart from dpANS C's language-sensitive collating function
strcollate() and the dreaded setlocale().)

shebs%defun.utah.edu.uucp@utah-cs.UUCP (Stanley T. Shebs) (03/12/88)

In article <756@cresswell.quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:

What he says.  O'Keefe makes some strong points about the complexities
of case conversion in various languages.  I suppose case conversion on
individual characters is too prevalent to drop it in favor of "word" or
"sentence" case conversion!  I haven't done any significant text processing
in CL, so can't comment on the "correct" practice.

>I think the Common Lisp character abstractions aren't quite right either.
>Trying to treat control-super-hyper-X as a single character is not quite
>right.  (How many versions of Common Lisp have (> (char-bits-limit 1)) ?)

The real reason for having "char-bits" in CL has more to do with a certain
Lisp company than with sound technical reasons.  Thus the "fancy" characters
are not required to be storable into strings, which limits their usefulness!
Still, most commercial CL impls *do* have (> char-bits-limit 1), but the
main reason seems to be that there are usually about 24 bits available in
the standard representations, but only 7 are actually needed for a code, so
there's nothing to lose by saying that some of the remaining bits are 
"char-bits".  All pretty sad, actually...

>If characters are represented by integers, then it is straightforward
>to program up missing operations.  If characters are a separate data type,
>but that data type is missing many of its "natural" operations, then you
>wind up with murky code changing types all over the place.

A separate data type with conversion functions doesn't imply murky code,
if the missing operations have been written to keep all the murkiness to
themselves.

							stan shebs
							shebs@cs.utah.edu