[comp.lang.c] Hex escapes in strings

dant@tekla.TEK.COM (Dan Tilque;1893;92-789;LP=A;60aC) (01/09/88)

A question occured to me about hex escapes.  Where does the padding nybble
go when an odd number of hex digits are in the escaped string?  For example:

	"\x1A2B3 example"

The escaped constant has 5 hex digits which fit into 2.5 bytes.  Some byte
has to be padded with (I assume) null bits.  Which byte is it: the initial
or the trailing?  Does the proposed standard say?  This is something that
was not needed with octal escapes since they always fit into one byte.

Perhaps the standard should be changed to require an even number of hex
digits in an escaped string.

---
Dan Tilque

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/11/88)

In article <2938@zeus.TEK.COM> dant@tekla (Dan Tilque) writes:
>A question occured to me about hex escapes.  Where does the padding nybble
>go when an odd number of hex digits are in the escaped string?  For example:
>	"\x1A2B3 example"
>The escaped constant has 5 hex digits which fit into 2.5 bytes.

No!  First, it is not specified how big a char (byte) must be, except
that it must be AT LEAST 8 bits.  It could be much larger, although
16 bits is the largest I expect to find in any C implementation in
the near future.  Next, the escape \x1A2B3 represents a SINGLE char,
not a sequence of chars.  If the number is too large to fit in a
single char, as it would be for chars through 16 bits in size, then
how it is interpreted (still as a SINGLE char) is implementation-
defined.  Generally, excess high-order bits are discarded, although
that is not required.

The corresponding character literal '\x1A2B3' is, as always, an
int (NOT a char, no matter how small the value).  Again, if the
value doesn't fit within a SINGLE char, the interpretation is up
to the implementation.  Generally, the overflow bits are used to
assemble additional char subfields within the int, but again that
is not required.  Portable C programming requires that one not
use such over-long character literals.

Note that long hex escapes are intended for non-portable usage,
primarily in multi-byte character set environments, although they
are useful on unusual architectures having chars > 8 bits.

ray@micomvax.UUCP (Ray Dunn) (01/19/88)

In article <7021@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>
>Note that long hex escapes are intended for non-portable usage,
>primarily in multi-byte character set environments, although they
>are useful on unusual architectures having chars > 8 bits.
>

I was going to try to make this reponse witty, but it's too late in the day..

This Doug, seems to me either idiocy or arrogance, someone please tell me
which, or is the above statement included in the Semantics section of the
description of hex escapes in string constants, so I can ensure I'm using them
as the committee intended.

The unfortunate thing is, string constants are not just used for messages
etc, but as arrays of 8-bit data.  These often contain hex constants.  These
hex constants are often followed by hex digits.  So, not only do we have a
major existing code breaking problem, but also another example
of the verbosity being added to C (having to concatenate strings to avoid
the hex problem).  We can write "ABC\x12H... but must remember to write
"ABC\x12""F...  Great!

What is the expression about being committee'd to death?

Ray Dunn.   ..philabs!micomvax!ray

rbutterworth@watmath.waterloo.edu (Ray Butterworth) (02/02/88)

I too think that allowing arbitrarily long hex strings is not
a good idea, though not for the same reason.

Something that I've always felt should be in the standard printf
functions is allowing the "#" qualifier for %c and %s formats.

"%#c" would print the character as an appropriate escape sequence
if it wasn't a printing character.
(hex vs. octal would be implementation dependent)
e.g. \n, \f, \001, \0xff.

This would make error messages a lot easier to read.
Typically they now say either "Illegal character '%c'\n", or
"Illegal character %#o\n".  The first is very ugly and unreadable
for non-printing characters, the second is unreadable for printing
characters if you haven't memorized the ASCII codes for '*', '}',
etc.  Using "%#c" would give the correct version every time.

Similarly, "%#s" would perform this same expansion on strings.
printf("%#s\n", "one\ntwo");
would produce a single 8 character line of output: "one\ntwo".

If ANSI allows arbitrarily long hex sequences, this scheme
cannot be implemented, since there is no way to indicate
the end of a hex sequence, and so "%#s" output could be
ambiguous.

The previously mentioned suggestion of a null-escape, e.g. "\z",
that did nothing, would solve the problem.