[comp.lang.c] Long Chars

TLIMONCE%DREW.BITNET@CUNYVM.CUNY.EDU (03/13/88)

(This discussion is looking for a good pun... like "lawn chairs"?  No... I
said GOOD pun.)

The "short char vs char" problem can't be solved very easily.  Why not a
"long char".  That wouldn't break much code now, would it?  Now I'm not
demanding that it goes into v1.0 of the standard but maybe we can look at
this for the next "congress".

For now, if you want to make some progress, try to get one of the biggies
(like MS) to add it as an extension.  You can tell them that they'll hit
on the "multi-nation/multi-language vendor market" with it.

Of course, in my programming I don't have a use for it, but if you do, try

typedef short LONG_CHAR;
or
typedef char LONG_CHAR[2];
(Hmmm... I like the former)

and then you can implement a lstrcmp() and a lstrcpy() and an assortment
of routines like that.  Then when you're done, those can be re-used in all
your programs.

When it get's suggested to ANSI C II (or whatever it'll be called) you'll
be there to warn us about implementation difficulties and ideas.  And when
it gets passed you can do a search-and-replace from "LONG_CHAR" to "long
char"

"And there was much rejoicing" -- Monty Python


Tom Limoncelli | Drew U/Box 1060/Madison NJ 07940 | tlimonce@drew.BITNET
    Disclaimer: These are my views, not my employer or Drew Univesity
--------------------------------------------------------------------------

gwyn@brl-smoke.ARPA (Doug Gwyn ) (03/13/88)

In article <12341@brl-adm.ARPA> TLIMONCE%DREW.BITNET@CUNYVM.CUNY.EDU writes:
>The "short char vs char" problem can't be solved very easily.  Why not a
>"long char".

This was basically what the Japanese originally requested.  The main
drawback is that any code that handles text characters (i.e. most
applications!) would have to be changed to use long-chars in order to
work in an international environment, and there would have to be
long-char versions of the usual string handling functions.  The
short-char proposal does not suffer from this drawback because a
char is already the right size to hold a text unit.  Its only
problem is that a fair amount of code has been written dependent on
the assumption that sizeof(char)==1, although some programmers have
been careful not to assume that all along.

kennedy@tolerant.UUCP (Bill Kennedy) (03/15/88)

In article <12341@brl-adm.ARPA> TLIMONCE%DREW.BITNET@CUNYVM.CUNY.EDU writes:
>[ pun reference omitted ]
>
>The "short char vs char" problem can't be solved very easily.  Why not a
>"long char".  That wouldn't break much code now, would it?  Now I'm not
>demanding that it goes into v1.0 of the standard but maybe we can look at
>this for the next "congress".

There are already specifications for it, AT&T has one and I think I read
something from HP about it as well.  It can be solved rather easily and
it need not break much code if the code is well written.  The same old
dragon that breathed up the pointer/int thing just rears its ugly head
again for characters.

>For now, if you want to make some progress, try to get one of the biggies
>(like MS) to add it as an extension.  You can tell them that they'll hit
>on the "multi-nation/multi-language vendor market" with it.

I disagree.  I am using long characters for a specific purpose and adding
the baggage to domestic computing wouldn't serve any useful purpose.  I don't
think that you will get a software vendor to weave it in if it costs
performance at compile or run time (which they do, both...).  The hardware
manufacturers will implement it themselves if they want to penetrate farther
into the overseas markets.  Remember it's not just a world of 7 or 15 bit
characters, variations on the Roman alphabet are handled, e.g. Europeans,
with the eighth bit (has it's own problems too, not pertinent).  I don't
think that you will get any momentum at all from software houses but I have
first hand knowledge :-) that the computer manufacturers get pretty interested.

>Of course, in my programming I don't have a use for it, but if you do, try
>
>typedef short LONG_CHAR;
>or
>typedef char LONG_CHAR[2];
>(Hmmm... I like the former)

No offense intended but I wholeheartedly agree with "don't have a use..."
and I would suggest it reads "haven't had any experience with...".  I'm
also not scolding you, I work with the things every day and there are some
very real traps.  If you just make it a typedef you'll get your storage
sizes right (for the most part) but you can't manupulate either of your
examples very well.  I use lchar because it's easier to type then LONG_CHAR.
You need a further refinement so that you can look at each byte and the bits
within each byte, I use a structure and a union within that.

>and then you can implement a lstrcmp() and a lstrcpy() and an assortment
>of routines like that.  Then when you're done, those can be re-used in all
>your programs.

You also need routines to convert into and out of strings containing long
characters and some way to insulate yourself from cases and while(c)
things that make assumptions about character size and content.  To qualify
the long character structure/union approach, vi, the shell, and I'm sure
other programs use the MSbit of a character for their own pruposes.  Many
Asian terminals set the MSbit of a byte as a flag that another byte is coming
with the rest of the character.  In some European countries it's quite normal
for the MSbit to be set for a special character native to their alphabet but
absent from ASCII.  So here you see but three uses of the MSbit that are
darned near mutually exclusive and require further inspection of the byte
stream.

>When it get's suggested to ANSI C II (or whatever it'll be called) you'll
>be there to warn us about implementation difficulties and ideas.  And when
>it gets passed you can do a search-and-replace from "LONG_CHAR" to "long
>char"

I'm not convinced that it belongs in the language specification because it
is so implementation specific.  In fact I'm not sure that it even needs to
exist for hardware destined for a technical audience.  Those professionals
have learned to read ASCII like some of us did APL :-)  When you start to
bring in commercial applications where you want to drive down the level of
skill required to operate a program, that's where you need the additional
capability/overhead.  You made a good start and now I have overkilled it for
you...

These are my opinions and observations, Tolerant is nice enough to let me use
their equipment; so don't blame me on them.

Bill Kennedy {rutgers,cbosgd,killer}!ssbn!bill or bill@ssbn.WLK.COM

karl@haddock.ISC.COM (Karl Heuer) (03/15/88)

In article <12341@brl-adm.ARPA> TLIMONCE%DREW.BITNET@CUNYVM.CUNY.EDU writes:
>[Until "long char" gets added, probably in the 2nd standard, try]
>typedef short LONG_CHAR; [and add a bunch of library functions].

Something equivalent is in fact in the current standard; they called it
wchar_t (wide character type).  I've only just gotten hold of a dpANS recent
enough to include this, and haven't finished reading it, but my impression is
that only the type, the corresponding constants (L'x' for wchar_t, L"x" for
wchar_t[]), and a few utility functions are being added for this standard.

The real problem is that "char" has been inappropriately overloaded.  We need
to distinguish between a text character (wchar_t or long char), a small
integer (short short int), and a quantum of memory (byte_t or short char).
Ideally, all three of these should have names other than "char", and the type
"char" should be deprecated.  Unfortunately, there's so much inertia to
overcome that this will probably never be fixed in C.  Fix it in "D"...

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

jay@splut.UUCP (Jay Maynard) (03/22/88)

From article <7447@brl-smoke.ARPA>, by gwyn@brl-smoke.ARPA (Doug Gwyn ):
> [...] Its only problem is that a fair amount of code has been
> written dependent on the assumption that sizeof(char)==1, although
> some programmers have been careful not to assume that all along. 

My initial knee-jerk reaction to this was "Hey, wait a minute!
sizeof(char) is defined to be 1!" Before I made a fool of myself on the
net (wouldn't be the first time...:-), though, I picked up the copy of
K&R that my recent desk-cleaning revealed. Sure enough, section 7.2, at
the top of page 188 in my copy, says:
"A _byte_ is undefined in the language except in terms of the value of
sizeof. However, IN ALL EXISTING IMPLEMENTATIONS a byte is the space
required to hold a char." (Emphasis added.)

I don't know how much existing code this would break (though I'd bet
there would be quite a bit of it). It does mean that I, too, will be
careful not to make that assumption...

-- 
Jay Maynard, EMT-P, K5ZC...>splut!< | GEnie: JAYMAYNARD  CI$: 71036,1603
uucp: {uunet!nuchat,academ!uhnix1,{ihnp4,bellcore,killer}!tness1}!splut!jay
Never ascribe to malice that which can adequately be explained by stupidity.
The opinions herein are shared by none of my cats, much less anyone else.

flaps@dgp.toronto.edu (Alan J Rosenthal) (03/25/88)

*sigh*... it took me a full year of the start of my C career to decide
finally that sizeof(char) really was guaranteed to be 1, due to the
constraint that all objects were made up of chars (i.e. a char * can
traverse any object), recently formally formalized by ANSI, previously
informally formalized by the existence of memcpy() / bcopy() and friends.

Why do you need to make sizeof(char) == 2 just to make chars 16 bits?
Make chars 16 bits, keep sizeof(char) == 1, also make sizeof(int) == 1
and sizeof(long) == 2, etc.  If ANSI requires plain char to be signed in
all implementations in which sizeof(char) == sizeof(int), we're all set.

ajr
-- 
If you had eternal life, would you be able to say all the integers?

gwyn@brl-smoke.ARPA (Doug Gwyn ) (03/26/88)

In article <8803250401.AA01184@champlain.dgp.toronto.edu> flaps@dgp.toronto.edu (Alan J Rosenthal) writes:
>Why do you need to make sizeof(char) == 2 just to make chars 16 bits?
>Make chars 16 bits, keep sizeof(char) == 1, ...

The idea is that you not only need to handle fat chars, you also
have applications that need to handle smaller objects (bytes, or
bits).  Therefore there would have to be some object type smaller
than a char (e.g. a "short char").

hermit@shockeye.UUCP (Mark Buda) (03/26/88)

In article <439@splut.UUCP> jay@splut.UUCP (Jay Maynard) writes:
>"A _byte_ is undefined in the language except in terms of the value of
>sizeof. However, IN ALL EXISTING IMPLEMENTATIONS a byte is the space
>required to hold a char." (Emphasis added.)

Therefore, if your compiler says that sizeof(char) != 1, it clearly does
not exist.

--
Mark Buda			Smart UUCP: hermit@chessene.uucp
Dumb UUCP: ...{rutgers,ihnp4,cbosgd}!bpa!vu-vlsi!devon!chessene!hermit
"One look at you, sir, is proof that anything is possible."

nevin1@ihlpf.ATT.COM (00704a-Liber) (03/30/88)

In article <7546@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>In article <8803250401.AA01184@champlain.dgp.toronto.edu> flaps@dgp.toronto.edu (Alan J Rosenthal) writes:
>>Why do you need to make sizeof(char) == 2 just to make chars 16 bits?
>>Make chars 16 bits, keep sizeof(char) == 1, ...

>The idea is that you not only need to handle fat chars, you also
>have applications that need to handle smaller objects (bytes, or
>bits).  Therefore there would have to be some object type smaller
>than a char (e.g. a "short char").

This makes me think that way back when KK&R defined C, they should have
called the 'char' type a 'byte' type instead.  Because of existing
practice (whether it be good or bad, it is common), I feel that the
sizeof(char) == 1.  70% of the time that I use char I use it for doing
byte-type operations (reading in from a file, etc.).

There is a need for having a fundamental type (call it foo) such that
sizeof(foo) == 1 can be guaranteed in *ALL* implementations.  Due to
existing practice, I would like that type to be called char.  Just add
things like 'long char' to accomodate the people who need them.
-- 
 _ __			NEVIN J. LIBER	..!ihnp4!ihlpf!nevin1	(312) 510-6194
' )  )				"The secret compartment of my ring I fill
 /  / _ , __o  ____		 with an Underdog super-energy pill."
/  (_</_\/ <__/ / <_	These are solely MY opinions, not AT&T's, blah blah blah

gwyn@brl-smoke.ARPA (Doug Gwyn ) (03/31/88)

In article <4191@ihlpf.ATT.COM> nevin1@ihlpf.UUCP (00704a-Liber,N.J.) writes:
>There is a need for having a fundamental type (call it foo) such that
>sizeof(foo) == 1 can be guaranteed in *ALL* implementations.  Due to
>existing practice, I would like that type to be called char.  Just add
>things like 'long char' to accomodate the people who need them.

sizeof(bit)==1 can be guaranteed universally.

If you mean addressable object, there is no single size universally
supported by computer hardware.

The problem with preempting "char" for small objects is that most C
code thinks that a "char" is big enough to hold a primitive unit of
text.  This is plainly wrong in some environments unless "char" is
made pretty large.  (It needs to be 16 bits for Imagen's GASCII, for
example.)  "char" cannot play both roles at once, and "long char" is
contrary to the current use of "char" majority of existing code (as
well as requiring a whole slew of lstr*() library functions).

karl@haddock.ISC.COM (Karl Heuer) (03/31/88)

In article <4191@ihlpf.ATT.COM> nevin1@ihlpf.UUCP (00704a-Liber,N.J.) writes:
>There is a need for having a fundamental type (call it foo) such that
>sizeof(foo) == 1 can be guaranteed in *ALL* implementations.  Due to
>existing practice, I would like that type to be called char.  Just add
>things like 'long char' to accomodate the people who need them.

The problem is that there are three distinct types of objects (small integers,
allocation quanta, and characters), all of which have traditionally been
called "char".  We can't keep existing practice on all three, and still have
useful programs in large-alphabet environments.

The current dpANS still equates the first two, but has created wchar_t for the
third.  I'm seriously considering adopting a convention that eschews all use
of the word "char" (much as some people avoid "int") in favor of a good set of
typedefs.  (Certainly I'd change this for "D".)

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

nevin1@ihlpf.ATT.COM (00704a-Liber) (03/31/88)

In article <7586@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:

>The problem with preempting "char" for small objects is that most C
>code thinks that a "char" is big enough to hold a primitive unit of
>text.  This is plainly wrong in some environments unless "char" is
>made pretty large.

C code *should* think that a "char" is big enough to hold a primitive unit
of text.  That's because K&R (1st edition) said (section 4, paragraph 4 of
the C reference manual):

"Objects declared as characters (char) are large enough to store any member
of the implementation's character set, ..."

Currently (pre-dpANS), if this is not true, then the language being
implemented is not K&R C (although, I'll admit, it's probably pretty close
:-)).

I do agree with you that right now "char" has too many uses and there is no
easy way to separate them due to the volume of existing code that uses
"char"s in different ways (assuming that I am not mis-paraphrasing you; if
I am, I'm sorry).
-- 
 _ __			NEVIN J. LIBER	..!ihnp4!ihlpf!nevin1	(312) 510-6194
' )  )				"The secret compartment of my ring I fill
 /  / _ , __o  ____		 with an Underdog super-energy pill."
/  (_</_\/ <__/ / <_	These are solely MY opinions, not AT&T's, blah blah blah

karl@haddock.ISC.COM (Karl Heuer) (04/01/88)

In article <4216@ihlpf.ATT.COM> nevin1@ihlpf.UUCP (00704a-Liber,N.J.) writes:
|In article <7586@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
|>The problem with preempting "char" for small objects is that most C
|>code thinks that a "char" is big enough to hold a primitive unit of
|>text.  This is plainly wrong in some environments unless "char" is
|>made pretty large.
|
|[But K&R says] "Objects declared as characters (char) are large enough to
|store any member of the implementation's character set, ..."

Ah, but a "primitive unit of text" need not be in "the implementation's
character set".  In particular, the latter can be an 8-bit superset of ASCII
which implements some Natural Language characters with two-byte codes.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

flaps@dgp.toronto.edu (Alan J Rosenthal) (04/05/88)

I, flaps@dgp.toronto.edu (Alan J Rosenthal) wrote:

>>Why do you need to make sizeof(char) == 2 just to make chars 16 bits?
>>Make chars 16 bits, keep sizeof(char) == 1, make sizeof(int) == 1, ...

gwyn@brl.arpa (Doug Gwyn) responded:

>The idea is that you not only need to handle fat chars, you also
>have applications that need to handle smaller objects (bytes, or
>bits).  Therefore there would have to be some object type smaller
>than a char (e.g. a "short char").

I now respond:

First of all, why would you possibly want to access bytes?  Bytes are
machine-dependent things with no high-level analogue.  You certainly
might want to access some object which is small enough to use for
traversing an arbitrary object.  16-bit chars would still have this
property so long as all objects were a multiple of 16 bits long.

As for being able to access bits, sizeof(char) would have to be equal
to 8 or 16 for that, not just 2.  Also, creating smaller objects than
chars will cause a lot of other problems, such as requiring either
introducing the concept of alignment into the language (e.g. arguments
to memcpy must be char-aligned) or making the arguments to routines
like memcpy be pointers to this smaller object at the expense that that
incurs.

ajr
-- 
"Comment, Spock?"
"Very bad poetry, Captain."