[comp.lang.ada] characters with codes >= 128

sommar@enea.UUCP (08/31/87)

I'd like to write a programme that can handle text which contains
characters from an extented ASCII set for covering national 
characters. The LRM seems to totally disregard this, since it
states that the character type is ASCII with 128 possible values.
Also, Ada only allows you to have printable characters within
strings. And printable is defined as the range ' '..'~'. 
  Easy, you might say. Just define a new character type. How?
I can't have quoted strings for the new characters, since they
are "non-printing". I can't extend the ASCII package (in STANDARD),
since it relies on that the character type is already defined.
  And even if I succeed somhow, how to with Text_io? Will the 
compiler accept attempt to give Text_io the new character type,
even if it's called "character"? Hardly.
  Have I missed someting? I hope. If not, THIS IS A VERY SERIOUS
RESTRICTION IN ADA.

I should add that to some extent it is possible to handle these
characters. My Ada system (Verdix 5.2A for VAX/Unix) doesn't mind
if I read an extended character from a file or if I try to write
it. Character'val(ch) on the character returns the correct code.
But Character'pos(Character'val(ch)) raises Contraint_error if
ch is from the upper half.
  But this only a little. I want string constants in my programme,  
it's dead. What do? Read them from a file at start-up? :-)
-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP

mooremj@EGLIN-VAX.ARPA ("MARTIN J. MOORE") (08/31/87)

I encountered the same problem in attempting to use DEC's extended character
set.  I worked around it by using an UNCHECKED_CONVERSION to stick 8-bit
values into CHARACTER objects (thereby making the program erroneous according
to the LRM; however, it worked.) 

For example, to use the DEC control character CSI (= 155) I did:

  function EIGHT_BIT_CHARACTER is new UNCHECKED_CONVERSION (INTEGER, CHARACTER);

  CSI : constant CHARACTER := EIGHT_BIT_CHARACTER (155);

Characters so defined could then be used in string constants, such as
the following:

  ERASE_SCREEN : constant STRING := CSI & "2J";	-- ANSI erase screen command

------------------------------------------------------------------------------
Martin Moore
mooremj@eglin-vax.arpa
------

stt@ada-uts (09/02/87)

You may define your own enumeration type, but you are correct
that only "standard" ASCII graphic characters may be used
in character and string literals.  For "characters" which
have no standard ASCII graphic representation, you should
define normal non-character enumeration literals.

You may then define your own IO package (there is generally nothing
"magical" about TEXT_IO, except that the compiler-writer
has to provide it) to provide I/O for characters and arrays
of these extended characters.

Extended "string" literals can be created via concatenate
of strings containing only graphic characters and enumeration
literals for extended characters.

S. Tucker Taft
Intermetrics, Inc.
Cambridge, MA  02138

P.S.:  Here is an example:

package Extended_ASCII is
    type X_Character is (NUL, SOH, . . ., 'a', 'b', . . .,
      UMLAUT, ALPHA, OMEGA, . . .);
    type X_String is array(Positive range <>) of X_Character;
    pragma Pack(X_String);
end Extended_ASCII;

with Extended_ASCII; use Extended_ASCII;
package X_Text_IO is
    . . .
    procedure Put_Line(Str : X_String);
    . . .
end X_Text_IO;

with Extended_ASCII; use Extended_ASCII;
with X_Text_IO;
procedure Test is
    S : constant X_String :=
      "This is an " & ALPHA & " to " & OMEGA & " test."
begin
    X_Text_IO.Put_Line(S);
end Test;

sommar@enea.UUCP (09/05/87)

In a recent article colbert@hermix.UUCP writes:
 >You can create your own Character type by defining an enumeration type that
 >has character literals.
 >	type Character_Type is (Nul, Del, ..., 'A', 'B', ...,
 >				Koo_Kai, Khoo_Khai, ....);
 >...
 >Once you have this character type defined, you can create a string type by
 >defining an array of this character type:
 >
 >	type String_Type is array (positive <>) of Character_Type;
 >...
 >However, you will have to use catenation to create string_type expressions that
 >contain your countries special characters (and of course non-printable
 >characters).
 >...
 >As for the I/O of your language specific characters, you will need to create
 >a Thai_Text_IO (or something equivalent).  Ada does not say that Text_IO is
 >the ONLY text I/O package, only that it is the standard text I/O package.  In
 >this case you need something non-standard.

I think Martin Moore's solution was much more simple and elegant. It will
work on any Ada system that doesn't check character assignments for 
Constraint_error.
  This solution requires one hell lot of work and it isn't portable from
OS to another. Yes, I can write my own Text_IO, but guess how fun I find
that. And, I will have to write one Text_IO for each OS I want to work
with. Guess why there is a standard Text_IO. It gives you a standard 
interface.
  But even better, a change in the language definition would be the 
approriate. It's ridiculus that perfectly good letters are being 
regarded as illegal and unprintable.

-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP

jmoody@DCA-EMS.ARPA (Jim Moody, DCA C342) (09/09/87)

It's not clear that there's a conflict between Martin Moore's solution
to the problem and that of colbert @hermix.UUCP.  Colbert is clearly
correct that formally one should create a new version of text_io.
Martin tells you how to do that on certain targets.  There is little
thought required to turn Martin's solution into a full-blown text_io
package (about three minutes, don't invite A. E. Housman's scorn), and
not more than a couple hours typing.  The point is that that makes all
applications which use (say) Thai_text_io portable in the sense that it
isolates the machine dependencies (does Martin'e trick work) into a
single package.  Which, I thought, was the point.

Jim Moody
DCA/JDSSC

colbert@hermix.UUCP (colbert) (09/10/87)

In response to my answer to his question about character types
sommar@seismo.css.gov  (Erland Sommarskog) writes:

>I think Martin Moore's solution was much more simple and elegant. It will
>work on any Ada system that doesn't check character assignments for 
>Constraint_error.
>  This solution requires one hell lot of work and it isn't portable from
>OS to another. Yes, I can write my own Text_IO, but guess how fun I find
>that. And, I will have to write one Text_IO for each OS I want to work
>with. Guess why there is a standard Text_IO. It gives you a standard 
>interface.

Unfortunately, Martin Moore's solution is NOT portable either.  It only works
because:

	1) Unchecked_Conversion is implemented in DEC Ada.

	2) The size of type Character objects in DEC is 8 bits.

	3) DEC did not give a Constraint_Error on the assignemt (which may
	   be a bug in DEC's implementation).

	4) DEC does not "place restrictions on unchecked conversions"
	   (13.10.2 P2);

	5) DEC truncates high order bits if the source value if its size is
	   greater than the size of the target type (this is really only a
	   problem with the specific example given by Moore, in that he used
	   the type Integer as the source type as opposed to an 8 bit type).

The principle benefit of my proposed solution is the creation of a portable
abstraction that represents the problem. Re-implementing a Text I/O for this
type is a small price to pay for this benefit (especially when Moore's
technique can be used in the implementation of this Text I/O - Sufficiently
issolated to prevent major impact on the system that I'm implementing and later
porting [as pointed out by another reader of this group]).

Take Care,
Ed Colbert
hermix!colbert@rand-unix.arpa

P.S.

As an additional comment, at the recent SIGAda Conference, Dr. Dewer indicated
that Unchecked_Conversion could be legally implemented to always return 0 no
matter what the "value" of the source object was.  I did not get a chance to
full nail him down on what he ment by this comment, so may be he will respond
to this message.

mooremj@EGLIN-VAX.ARPA ("MARTIN J. MOORE") (09/10/87)

> From: colbert <hermix!colbert@rand-unix.ARPA>

> Unfortunately, Martin Moore's solution is NOT portable either.  It only works
> because:
>                      [list of reasons]

He is absolutely correct.  My solution is non-portable and, as I pointed 
out in my original message, erroneous as defined by the LRM.  My purpose in
posting it was to possibly help the original questioner, since the solution
does work on the VAX and may work on other machines.  It wasn't intended to be 
a universal solution.  The approach suggested by Colbert et al is obviously 
the way to go to provide portability.  

				Martin Moore
------

barmar@think.COM (Barry Margolin) (09/10/87)

In article <8709100440.AA04224@rand-unix.rand.org> hermix!colbert@rand-unix.ARPA writes:
>As an additional comment, at the recent SIGAda Conference, Dr. Dewer indicated
>that Unchecked_Conversion could be legally implemented to always return 0 no
>matter what the "value" of the source object was.  I did not get a chance to
>full nail him down on what he ment by this comment, so may be he will respond
>to this message.

I presume that this refers to the fact that the language doesn't
specify what the result of Unchecked_Conversion is, rather it leaves
it implementation-defined.  In that case, an implementation may return
any value, and returning 0 in all cases would be valid.  It's not a
very useful behavior, but it isn't the purpose of a language spec to
define a useful language, merely a portable one.  Since you must check
the implementation spec to find out what the result is, you'll know
whether it is useful.

---
Barry Margolin
Thinking Machines Corp.

barmar@think.com
seismo!think!barmar

sommar@enea.UUCP (Erland Sommarskog) (09/12/87)

hermix!colbert@rand-unix.ARPA writes:
>In response to my answer to his question about character types
>sommar@seismo.css.gov  (Erland Sommarskog) writes:
>>I think Martin Moore's solution was much more simple and elegant. It will
>>work on any Ada system that doesn't check character assignments for 
>>Constraint_error.
>Unfortunately, Martin Moore's solution is NOT portable either.  It only works
>because:

Well, I didn't say it was portable, did I? I would say it is "presumably"
portable.

>	1) Unchecked_Conversion is implemented in DEC Ada.
As far I understand, the LRM doesn't mention Unchecked_character as optional.
It's true, however, that it allows for arbitrary restrictions.

>	2) The size of type Character objects in DEC is 8 bits.
A quite reasonable assumption with the architectutres of today. And
the only time, you're in trouble is when the size is exactly seven
bits. 

>	3) DEC did not give a Constraint_Error on the assignemt (which may
>	   be a bug in DEC's implementation).
Verdix Ada on VAX/Unix doesn't seem bother, either. It's probably a
violation of the langauge definition, but somehow it seems like
a frequent violation. (Check your Ada system. Have it read a non-ASCII
character and mingle around with it. Do you get Constraint_error?)

>	4) DEC does not "place restrictions on unchecked conversions"
>	   (13.10.2 P2);
True, as I said under 1), but other hand what reasons for restrictions
are there in this case?

>	5) DEC truncates high order bits if the source value if its size is
>	   greater than the size of the target type (this is really only a
>	   problem with the specific example given by Moore, in that he used
>	   the type Integer as the source type as opposed to an 8 bit type).
Yes, replace Integer with Very_short_integer and it's fixed.

>The principle benefit of my proposed solution is the creation of a portable
>abstraction that represents the problem. Re-implementing a Text I/O for this
>type is a small price to pay for this benefit (especially when Moore's
>technique can be used in the implementation of this Text I/O - Sufficiently
>issolated to prevent major impact on the system that I'm implementing and later
>porting [as pointed out by another reader of this group]).
>to this message.

Mr. Colbert seems to share my opinion about "presumably" portable. Else 
he wouldn't propose Moore's technique in Text_io_8_bit. Now, we don't
have to rewrite Text_io_8_bit until we meet Ada system does not implement
things as we need them.
  It's perfectly true, that if we stick to Moore's original idea, we have
a lot more code to rewrite. If I were to write a 10000-lines system I
would surely consider Text_io_8_bit. Presently, I'm not. Just want to
write some small pieces of code to demonstrate more a meaningful character
comparisons than the use of a simple collating sequence. (Those who read
comp.std.internat know what I'm talking about.)
  And, guys, can't we agree on that it would have been much easier if
the language definition in one way or another had given place for a wider 
character concept than 128 ASCII codes?



-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP

jmoody@DCA-EMS.ARPA (Jim Moody, DCA C342) (09/14/87)

Erland Sommarskog (sommar@enea.uucp) gets to the heart of the
disagreement when he writes:
    And, guys, can't we agree on that it would have been much easier
    if the language definition in one way or another had given place
    for a wider character concept than 128 ASCII codes?
No it wouldn't.  Or at least, easier for whom?
Text_IO, remember, is standard.  That means that all vendors must support
it.  And must support it to all output devices (not just bright terminals).
That means printers with hammers which are limited to the ASCII 95 graphic
characters.  The only reasonable way of requiring vendors to support
something more than the 95 characters plus ASCII.HT would be to make
Text_IO generic.  This brings its own problems:  there is currently no
provision for a generic formal parameter to be restricted to character
type and indeed no requirement that a compiler recognise character types
as a separate semantic category.  I do not know that LRM 3.5.2 is
referenced elsewhere in the LRM.  This means that doing what Sommarskog
wants imposes costs on a vendor/implementor which are not limited to
the Text_IO package but spread into the middle part of the compiler.
If we have a cost of such magnitude, we are entitled to ask what benefit
to the user community as a whole does it produce.  I think that it was a
reasonable decision to limit the standard to the 95 ASCII printables plus
ASCII.HT which means that if someone wants to use other characters, he/she
has to shoulder the cost his/herself rather than have the entire user
community pay.  I emphasis that this is a cost/benefit decision which 
could change in the future.  One of these days, Ada standardisation will
be reopened.  If at that point, it is clear that a substantial segment of
the user community is using or wants to use a bigger character set, the
benefits of centralising the cost of supporting them may outweigh.  I
doubt that it does now.  That is, the cost to Sommarskog of implementing
the subset of text_io which he needs plus the cost to the other users
of implementing the subsets of text_io that they need for the character
sets they want to use is less than the cost (for 137 compilers at last 
count) of requiring vendors to support bigger character sets.  Maybe I'm
wrong.  Maybe there are a thousand applications out there which need
bigger character sets (I think that's the order of magnitude needed for
it to be cheaper on the whole for vendors to support).  If there are, 
then ISO/ANSII/AJPO probably need to be told.

Usual disclaimer:  the opinions expressed are my own and should not be
construed as the opinions of the US Government.

Sorry to go on at such length.

Jim Moody

sommar@enea.UUCP (Erland Sommarskog) (09/15/87)

jmoody@DCA-EMS.ARPA (Jim Moody, DCA C342) writes:
>Erland Sommarskog (sommar@enea.uucp) gets to the heart of the
>disagreement when he writes:
>    And, guys, can't we agree on that it would have been much easier
>    if the language definition in one way or another had given place
>    for a wider character concept than 128 ASCII codes?
>No it wouldn't.  Or at least, easier for whom?

As you have guessed, I'm not giving in that easy. 

>Text_IO, remember, is standard.  That means that all vendors must support
>it.  And must support it to all output devices (not just bright terminals).
>That means printers with hammers which are limited to the ASCII 95 graphic
>characters.  

That is not an argument. If that is true, we should immediately get away
with the lowercase letters. There are printers who don't know them and
leaves spaces where thext should have been. And, yes, Text_io allows you
send any ASCII character. (Or since when did Put(ASCII.ETX) become illegal?)

More on the spot, what the output device does with bits sent to it is
beyond the scope of the language definition. Or else all vendors will
be in big trouble. How can they assure that ASCII.L_BRACKET always
turn up as a left bracket? Sent to a Swedish terminal you are likely to
get a capital A with dots over.

>The only reasonable way of requiring vendors to support
>something more than the 95 characters plus ASCII.HT would be to make
>Text_IO generic.  This brings its own problems:  there is currently no
>
> ...Long discussion on character generic and cost/benefit

The very easy solution is to remove the restriction of character/
string literals to consist of only printable characters. A very cheap
modification, even in 137 compilers. Also the language must explicitly 
allow character codes up to 255, which several compilers already seem
to do. (Any one who knows of one that doesn't?) These two changes are
the minimum, although the language definition would be cleaner with
some packages defining names for all characters. (You need more than
one, since there exists, or will exist, more than one ISO standard
in parallel.)

I could stop here, but let me continue to a definitely more costsome
solution. Ed Colbert talked about the virtue of data abstraction
when advocating the idea of writng a new Text_io. But the character
type is as concrete it can be. Character comparisons based the
codes used for communication is ridiculous when you think of it. It
happens to give the correct result for English, but that is the one.
(With Ada it doesn't really matter, since it seems to disallow other
languages. :-) What I like to see is language-dependent comparison
operators. Languages being selected somewhere, in the OS, a pragma
or set dynamically.
  This last idea is of course not unique for Ada, but general for
all programming langauges.
-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP