[comp.std.internat] A proposal for a new character standard

sommar@enea.UUCP (09/06/87)

This is my idea of how to change the character concept. Before
we start I'd like to point out what I want to change:

* The character type. In most languages it's being regarded
  as another enumeration type. I want to make it an abstract
  data type.
* Comparison operators like <, <=, >, >=, = and /= for characters
  and strings. The result of them should refer to the currently
  chosen language, not the numeric coding.
* Communication between computers and terminal/printers. The coding
  scheme I propose is only of importance on this level.
* Hardware. It must be able to handle the new coding scheme.

The proposal covers only alphabets using Latin letters. I know
too little to discuss Cyrillian or Japanese. A serious drawback 
with ISO 8859/1-4 is that it's impossible to mix letters from
the various sets. My solution has the same drawback if you want
to mix Latin with Cyrillic, but this is much less likely.


So, first to the character type and the comparison operators.
What we need here is run-time library support. You choose
a language, probably you should be able to do this on several
levels. On OS level you could have a SET LANGUAGE command
making all system programmes behave according to the rules of
that language. (I'm talking about character comparisons here. Other
things like date formats could be included, but that's 
irrelevant here.) In your own programmes you could decide with a
pragma at compile time and standard routine calls at run-time.
The default should be the choice at the OS level. Note that
one of the available languages could be ASCII.
  The really important point here is that the underlying 
representation should be hidden. It may be the same as on
the communication level, but the OS may prefer to convert
to an internal format when reading the terminal line.


So to the communication level. Actually we could choose any
representation we like, since our programmes do not depend
on them any more. An easy solution would use 16-bit codes,
but it would be a waste of channel capacity. And probably
we don't need that many. But of course for communication we could
have 10-, 11- or 12-bit if we like. My idea is a bit different,
though.
  I have two sets of 128 characters, an ordinary set and a 
modifier set. All characters are sent with eight bits. The 
most significant bit is 0 if the next character is from the 
ordinary set and 1 if it's a modifier. For example "a" with 
grave accent is sent as the code for "a" + 128 followed by
the code for the grave accent. Note that this enables handling
of multiple modifiers on one character. If we want to have a
cedilla too we set the upper bit on the code for grave accent as
well.
  The modifiers are of two kinds: Concrete and abstract. The
first group contains accents, diaraesis and such. The abstract
are upper case, ligature, superscript and subscript. Having
case as a modifier saves many slots in the ordinary set. The 
ligature modifier means that the following character is to be
combined with the former. 
  The ordinary set consists of 28 lower case letters and the
numbers. 28? Yes, A to Z minus W plus Icelandic Thorn and Edh
and German double-s. W is sent as V-ligature-V. Perhaps we must
also give place for dotless "i".
  (The ligature modifier will probably be hard to implement in
hardware. It's no disaster if we have to give it up. We would have
to add four more characters: AE, IJ OE, and VV.)
  I assume that we want 32 slots for non-printing control characters
in both sets. With about 20 modifiers, 30 characters and 10 figures
this gives about 130 slots free for other symbols like !"#$%. Actually
there is place for about 60 more since we could use the upper-case
modifier on non-letter characters. 190 symbols may sound much, but
I'm quite sure we could fill them. ASCII misses many, and still have 
some that are superfluous outside USA. (@ for example.)
  Note that a modifier could easily be sent as a separate character,
send it as a modified space.
  The standard ought to define a control character that means "change
standard". The following byte would indicate if the coming characters
were from ASCII, the Cyrillian, the Japanese standard or whatever.

And so to hardware. A simple low-budget solution would just be super-
imposing the modifiers on the characters. Probably the more usual
cases would get their own matrices, though. Specifically this is true 
for the ligatures. 
  You wouldn't expect any terminal to support Cyrillian and such of 
course. Perhaps not all 190 symbols should be mandatory.
  Note that modifier concept is unknown at the keyboard. You press
W and the terminal sends V-ligature-V. For more rare characters
a "compose character" key could be used.


As you see, this is a quite radical standard. Probably it would
require something like 10 years to be commonly supported. And that
is if we settled it today. Still I think we have do something like
this. Computers are cripples in characters handling today. And don't
tell me we can't do this because a bunch of C programmes relies
on the character being a byte. We have the computers and the 
programmers to serve humanity, not the opposite.




-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP

karl@haddock.ISC.COM (Karl Heuer) (09/09/87)

In article <2254@enea.UUCP> sommar@enea.UUCP (Erland Sommarskog) writes:
>The ordinary set consists of 28 lower case letters and the numbers.  28?
>Yes, A to Z minus W plus Icelandic Thorn and Edh and German double-s.  W is
>sent as V-ligature-V.  Perhaps we must also give place for dotless "i".

Of course, dotless-i could *replace* dotted-i; then dotted-i would become an
accented character.  We could also save a slot by using i-cedilla for j.  Then
the dutch ligature ij would be sent as I-dot-ligature-I-dot-cedilla!

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

jc@minya.UUCP (John Chambers) (09/15/87)

In article <1072@haddock.ISC.COM>, karl@haddock.ISC.COM (Karl Heuer) writes:
> 
> Of course, dotless-i could *replace* dotted-i; then dotted-i would become an
> accented character.  We could also save a slot by using i-cedilla for j.  Then
> the dutch ligature ij would be sent as I-dot-ligature-I-dot-cedilla!

No, Karl, it'd be more efficient to use i-ligature-j-cedilla-umlaut (where 
*both* 'i' and 'j' are dotless).

(I knew there had to be a language that combines umlauts and cedillas! :-)

-- 
	John Chambers <{adelie,ima,maynard}!minya!{jc,root}> (617/484-6393)