[comp.lang.eiffel] Eiffel cleanup #5: The character set

bertrand@eiffel.UUCP (Bertrand Meyer) (01/21/90)

A planned change for the character set of Eiffel
------------------------------------------------

This is the fifth of a sequence of postings describing the
changes planned for version 3 of Eiffel.

These are cleanup changes and do not affect anything fundamental.

The particular issue discussed here is the character set for
the language and the support for national variants.
Since the preparation of version 3 is clean-up time,
I have finally looked in some detail at these theoretically mundane
but practically important questions.

I must confess that not much thought went into these aspects when
the language was designed, but now is the time to get them right.
It's great to have the proper solution for the big things,
but why not take care of the little ones as well?

I must further admit that I have no particular expertise in this
field, and won't be offended if it turns out that someone else
has a better answer.

I was considerably helped by the contribution made by Erland
Sommarskog last July (<101@enea.se>). In fact, the advantage I gained
from that posting is almost unfair; Mr. Sommarskog did all the hard
work, and I only had to draw the conclusions for Eiffel.
(Of course, he bears no responsibility for any deficiency in
what follows.) Anyone who wants to contribute comments or criticisms
will be well advised to look at Mr. Sommarskog's message first.

The solution described below only addresses one-byte codes (such as
those used for European languages other than English). No
consideration has been given to multi-byte languages. (If we don't
leave some work for the standardization committee, they might get
bored.)


----------------------------------------------------------------------
|WARNING: The change described here is planned for version 3 of the   |
|environment, not to be released until late 1990.                     |
|                                                                     |
|Any change in the language supported by Interactive's tools          |
|will be accompanied by CONVERSION TOOLS to translate ``old'' syntax  |
|into new. Programmers will NOT need to perform any significant work  |
|to update existing Eiffel software.                                  |
|                                                                     |
|This posting is made solely for the purpose of informing the Eiffel  |
|community about ongoing developments. Although the posting has been  |
|preceded by careful reflection and internal discussions within       |
|Interactive, we make no commitment at this point that the features   |
|described here will actually be included, and, if they are, that     |
|their final form will be the exact one shown below.                  |
----------------------------------------------------------------------


Purpose of the change.
----------------------

Several problems were raised by Mr. Sommarskog with respect to the use
of Eiffel on non-American keyboards/terminals.

1.   - Characters such as

    @ At sign
    [ Opening bracket
    ] Closing bracket
    { Opening brace
    } Closing brace
    | Vertical bar
    \ Backslash
    ^ Circumflex
    ` Back quote
    ~ Tilde

are often pre-empted by national character set variants. For
example, on many French keyboards, [ and ] appear as e'
(e with an acute accent) and e` (e with a grave accent).
Mr. Sommarskog cited further examples with Swedish keyboards.
These character translations make it very unpleasant for programmers
working on such keyboards, who have to remember the correspondence
between the character in the language manual and their local
keyboard equivalents.

Mr. Sommarskog went so far as to say that ``Any programming language
using any of [these] characters as an operator or a delimiter is
committing a crime in my eyes''. (He did not say, however, that the
language *designer* was committing a crime, so I feel relatively
safe even though I will be traveling to Sweden soon.)

The problem does exist in Eiffel since all of the above except `
(back quote) are used in special symbols. (| and ~ will be used
for boolean operators in Eiffel 3.)

2.   - The syntax of identifiers restricts them to letters, underscores
and digits. ``Letters'' here means unaccented letters of the English
alphabet. A French programmer would often like to use accented letters
in an identifier (e.g. e've'nement, with two acute accents), and
similarly for other languages.

3.   - The backslash is particularly ``criminal'' since it has an
important role in strings and character constants as the ``escape''
for special characters, in the Unix-C tradition. For example
a quote in a character constant is \' (backslash-quote).

4.   - As a less important but unpleasant point, special characters
are specified through a three-digit octal code, as in '\756'. Why
force octal? Also, why require exactly three digits, which imply
leading zeroes?


The language change
-------------------

The language change is simple.

First, an observation: brackets and braces are (fortunately) not
strictly needed syntactically in Eiffel: parentheses would do just as well
in the places where these characters are needed. (Brackets are used
for generic parameters; braces for selective exports.)

As a consequence, parentheses now become legal in those places,
although the forms using brackets and braces remain the
standard ones for publication of program texts. Brackets
and braces will continue to be used as the default form for
text produced as output of tools of the Eiffel environment
such as ``short'', even if the original class text uses parentheses.
(Presumably, a decent troff/TEX/Interleaf/Word
adapted to Swedish, Polish or French will still have those characters.)

Similarly, equivalents are defined for ^ (for which the equivalent
is **, as used in Fortran for exponentiation), ~ (not) etc.

Then, the backslash loses its special role as an escape character in
character and string constants. It is replaced by the exclamation mark.
For example, in a character or string constant:

    !!        means        !
    !"        means        "
    !'        means        '
    !T        means        tab
    !N        means        new line
    !D(27)    means        the character of decimal code 27
    !O(27)    means        the character of octal code 27
    !X(27)    means        the character of hexadecimal code 27

etc. The convention for character strings split over two or more lines
remains as before, with ! instead of backslash. In all codes involving
letters (!T, !D etc.), lower- and upper-case are equivalent. For the last
three codes in the above list, note that the numerical value is
parenthesized, so that the number of digits is not fixed.

Finally, although the default alphabet for identifiers is
still the English letters plus digits and underscore, it becomes
possible to use others if they are specified in a special file
(which could be called ``.characters'' in the Unix implementation).
The idea of using a file rather than a compilation option is that if you
deliver classes to a customer (possibly in a different country)
you will deliver the .characters file as well, ensuring consistent
recompilation at the target site; with compilation options this
cannot be achieved.  Furthermore, a file is more flexible.

Obviously, some restrictions are imposed on the characters
that may be specified in the .characters file: they may not conflict
with characters used in special symbols of the language, such as
``;'' or ``:'', unless these symbols have default substitutes
(as with the bracket ``['', whose substitute is ``('').
Just as obviously, once a character has been selected for
identifiers through the .characters file, it cannot be used
as special symbol any more; for example, if you accept the
opening bracket in identifiers because its shows up as e'
on your keyboard, then you may not use it as a bracket any
more and must resort to parentheses.


Discussion
----------

The exclamation mark seems to be the least bad among universally
possible choices. Its use as an attention-getter in ordinary language
seems to fit well with its above use as a special character marker.

We have, of course, considered the obvious objection that a
new Eiffel programmer's first attempt may contain the instruction

    putstring ("Hello world!")

which will trigger a compilation error (because the exclamation mark eats
the following double quote). Tough luck. At least, we can
try to produce a decent error message.
-- 
-- Bertrand Meyer
bertrand@eiffel.com

weiner@novavax.UUCP (Bob Weiner) (01/23/90)

In article <236@eiffel.UUCP> bertrand@eiffel.UUCP (Bertrand Meyer) writes:

   Then, the backslash loses its special role as an escape character in
   character and string constants. It is replaced by the exclamation mark.

If this change were optional for international programmers it would be
fine but if it is required of all programmers it is unacceptable.  This
is the equivalent of all of the personal computer vendors that create
their own regular expression syntax.  They add no new functionality but
change the character set so that the largest body programmers who use
regular expressions (UNIX programmers) frequently must relearn something
they already understand well.  The backslash as a character quote has
stood the test of time for usage, readability, etc. on UNIX systems.  If
UNIX International has not seen fit to eliminate its usage, there is no
reason that such should be done in Eiffel.

Just add some mechanism in the character mapping file that lets
programmers use exclamation marks instead of backquotes if they want.

-- 
Bob Weiner, Motorola, Inc.,   USENET:  ...!gatech!uflorida!novavax!weiner
(407) 364-2087

shelley@atc.sps.mot.com (Norman K. Shelley) (01/24/90)

I heartily agree with weiner's comments on changing the backslash usage to
an exclamation point thereby changing a style that has been worked around
in the international UNIX world already.

In fact I would appreciate the ability to use the
exclamation point character as "not" and "/" (for "!=" instead of having
to use "/="). A personal mapping file, preprocessor, or whatever to allow my
taste (and many Unix/C programmers taste) would be extremely nice.


Norman Shelley
Motorola - ATC
2200 W. Broadway M350
Mesa, AZ 85202
...!uunet!dover!atc!shelley
shelley@atc.sps.mot.com
(602) 962-2473

jacob@gore.com (Jacob Gore) (01/25/90)

/ comp.lang.eiffel / shelley@atc.sps.mot.com (Norman K. Shelley) / Jan 24 1990/
> In fact I would appreciate the ability to use the
> exclamation point character as "not" and "/" (for "!=" instead of having
> to use "/="). A personal mapping file, preprocessor, or whatever to allow my
> taste (and many Unix/C programmers taste) would be extremely nice.

This is going too far.  It's one thing to accomodate hardware restrictions,
but quite another to provide character mappings for the purpose of personal
style.  Are you going to ship your personal mapping file with each file of
source code?  And what if there are several people working on one project
-- how are you going to associate their mappings with various files?
``#include "normans_key_map.h"''?

If "/=" bothers you so much, you can always run your programs through
something like "sed -e 's:!=:/=:g'" before letting the compiler (AND other
people) see it.

Jacob
--
Jacob Gore		Jacob@Gore.Com			boulder!gore!jacob