sommar@enea.UUCP (09/06/87)
This is my idea of how to change the character concept. Before we start I'd like to point out what I want to change: * The character type. In most languages it's being regarded as another enumeration type. I want to make it an abstract data type. * Comparison operators like <, <=, >, >=, = and /= for characters and strings. The result of them should refer to the currently chosen language, not the numeric coding. * Communication between computers and terminal/printers. The coding scheme I propose is only of importance on this level. * Hardware. It must be able to handle the new coding scheme. The proposal covers only alphabets using Latin letters. I know too little to discuss Cyrillian or Japanese. A serious drawback with ISO 8859/1-4 is that it's impossible to mix letters from the various sets. My solution has the same drawback if you want to mix Latin with Cyrillic, but this is much less likely. So, first to the character type and the comparison operators. What we need here is run-time library support. You choose a language, probably you should be able to do this on several levels. On OS level you could have a SET LANGUAGE command making all system programmes behave according to the rules of that language. (I'm talking about character comparisons here. Other things like date formats could be included, but that's irrelevant here.) In your own programmes you could decide with a pragma at compile time and standard routine calls at run-time. The default should be the choice at the OS level. Note that one of the available languages could be ASCII. The really important point here is that the underlying representation should be hidden. It may be the same as on the communication level, but the OS may prefer to convert to an internal format when reading the terminal line. So to the communication level. Actually we could choose any representation we like, since our programmes do not depend on them any more. An easy solution would use 16-bit codes, but it would be a waste of channel capacity. And probably we don't need that many. But of course for communication we could have 10-, 11- or 12-bit if we like. My idea is a bit different, though. I have two sets of 128 characters, an ordinary set and a modifier set. All characters are sent with eight bits. The most significant bit is 0 if the next character is from the ordinary set and 1 if it's a modifier. For example "a" with grave accent is sent as the code for "a" + 128 followed by the code for the grave accent. Note that this enables handling of multiple modifiers on one character. If we want to have a cedilla too we set the upper bit on the code for grave accent as well. The modifiers are of two kinds: Concrete and abstract. The first group contains accents, diaraesis and such. The abstract are upper case, ligature, superscript and subscript. Having case as a modifier saves many slots in the ordinary set. The ligature modifier means that the following character is to be combined with the former. The ordinary set consists of 28 lower case letters and the numbers. 28? Yes, A to Z minus W plus Icelandic Thorn and Edh and German double-s. W is sent as V-ligature-V. Perhaps we must also give place for dotless "i". (The ligature modifier will probably be hard to implement in hardware. It's no disaster if we have to give it up. We would have to add four more characters: AE, IJ OE, and VV.) I assume that we want 32 slots for non-printing control characters in both sets. With about 20 modifiers, 30 characters and 10 figures this gives about 130 slots free for other symbols like !"#$%. Actually there is place for about 60 more since we could use the upper-case modifier on non-letter characters. 190 symbols may sound much, but I'm quite sure we could fill them. ASCII misses many, and still have some that are superfluous outside USA. (@ for example.) Note that a modifier could easily be sent as a separate character, send it as a modified space. The standard ought to define a control character that means "change standard". The following byte would indicate if the coming characters were from ASCII, the Cyrillian, the Japanese standard or whatever. And so to hardware. A simple low-budget solution would just be super- imposing the modifiers on the characters. Probably the more usual cases would get their own matrices, though. Specifically this is true for the ligatures. You wouldn't expect any terminal to support Cyrillian and such of course. Perhaps not all 190 symbols should be mandatory. Note that modifier concept is unknown at the keyboard. You press W and the terminal sends V-ligature-V. For more rare characters a "compose character" key could be used. As you see, this is a quite radical standard. Probably it would require something like 10 years to be commonly supported. And that is if we settled it today. Still I think we have do something like this. Computers are cripples in characters handling today. And don't tell me we can't do this because a bunch of C programmes relies on the character being a byte. We have the computers and the programmers to serve humanity, not the opposite. -- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP
karl@haddock.ISC.COM (Karl Heuer) (09/09/87)
In article <2254@enea.UUCP> sommar@enea.UUCP (Erland Sommarskog) writes: >The ordinary set consists of 28 lower case letters and the numbers. 28? >Yes, A to Z minus W plus Icelandic Thorn and Edh and German double-s. W is >sent as V-ligature-V. Perhaps we must also give place for dotless "i". Of course, dotless-i could *replace* dotted-i; then dotted-i would become an accented character. We could also save a slot by using i-cedilla for j. Then the dutch ligature ij would be sent as I-dot-ligature-I-dot-cedilla! Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
jc@minya.UUCP (John Chambers) (09/15/87)
In article <1072@haddock.ISC.COM>, karl@haddock.ISC.COM (Karl Heuer) writes: > > Of course, dotless-i could *replace* dotted-i; then dotted-i would become an > accented character. We could also save a slot by using i-cedilla for j. Then > the dutch ligature ij would be sent as I-dot-ligature-I-dot-cedilla! No, Karl, it'd be more efficient to use i-ligature-j-cedilla-umlaut (where *both* 'i' and 'j' are dotless). (I knew there had to be a language that combines umlauts and cedillas! :-) -- John Chambers <{adelie,ima,maynard}!minya!{jc,root}> (617/484-6393)