[comp.std.c] How to write Trigraph like character sequences in a string

martin@mwtech.UUCP (Martin Weitzel) (05/30/91)

In article <RICHARD.91May29092137@lambda.iesd.auc.dk> richard@iesd.auc.dk (Richard Flamsholt S0rensen) writes:
>>>>>> On 28 May 91 23:12:53 GMT, bliss@sp64.csrd.uiuc.edu (Brian Bliss) said:

>> What if I want to use the sequence "??!" within a string?

>  puts("Too bad - it is impossible to use ??""! in a string  :-)");

puts("In fact, there is another impossible way to use ?\?! in a string  :-)");

puts("and what about 1) \??!");
puts("               2) ?""?!");
puts("               3) ?""?""!");
puts("               4) ??\!");

If you want to think a moment about which of the above could also work,
you now have the time (if your news-reader stops output on a formfeed).

I'm sure that 1) DOESN'T work (some books about ANSI-C are not aware
of this - they obviously didn't care about the phases of translation).
2) and 3) should work OK, where 3) is a lot to type and will usually
fall out of consideration; I'm not quite sure about 4) and I'm too lazy
to RTFAS (read the fine ANSI standard), since four ways to make the
impossible possible seem already to be enough.
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83

diamond@jit533.swstokyo.dec.com (Norman Diamond) (05/31/91)

In article <1157@mwtech.UUCP> martin@mwtech.UUCP (Martin Weitzel) writes:
>In article <RICHARD.91May29092137@lambda.iesd.auc.dk> richard@iesd.auc.dk (Richard Flamsholt S0rensen) writes:
>>  puts("Too bad - it is impossible to use ??""! in a string  :-)");
>puts("In fact, there is another impossible way to use ?\?! in a string  :-)");
WARNING!  NOT:  >puts("and what about 1) \??!");  [as Mr. Weitzel said]
>puts("               2) ?""?!");
>puts("               3) ?""?""!");
>puts("               4) ??\!");

>1) DOESN'T work (some books about ANSI-C are not aware of this

Uh, thank you, some books that falsely claim to be about ANSI-C...
Anyone care to name names, so that readers can be warned to avoid them?

>2) and 3) should work OK, [...] I'm not quite sure about 4)

It works.
I have seen recommendations for:
puts("               5) ?\?!");
but maybe I like 4) better than 5).  Furthermore, 4) and 5) work at
preprocessing time, when string concatenation has not been done yet.

Just to be perverse, here are two more:
puts("               6) ???/?!");
puts("               7) ????/!");
I lied; these are not just to be perverse.  If you need trigraphs, then
you need these.
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.
Permission is granted to feel this signature, but not to look at it.

minar@reed.edu (05/31/91)

>>> What if I want to use the sequence "??!" within a string?

easy. Use a compiler that is fully ANSI, but lets you turn off
trigraphs.  At least for compilers in America, that's more or less the
norm.. the Gnu compiler specifically requires you to turn trigraphs ON
(I think - that might have changed) and Borland doesn't even have
trigraphs in its compiler anymore, there's a program called
trigraph.com that does the translation for you, so just don't execute
it! Oh, I guess its not conforming code, but does anyone really use
trigraphs? Really, anyone? If I had a keyboard that wasn't fully C
capable, I'd certainly use something, but it wouldn't be trigraphs,
unless I had to ship the code elsewhere..

mycroft@goldman.gnu.ai.mit.edu (Charles Hannum) (05/31/91)

In article <m0jizgb-0001OzC@dharma.reed.edu> minar@reed.edu writes:

   the Gnu compiler specifically requires you to turn trigraphs ON
   (I think - that might have changed)

Not if you use '-ansi -pedantic' by default.  B-)  (Okay!  So I'm a pedant!)

   and Borland doesn't even have trigraphs in its compiler anymore,
   there's a program called trigraph.com that does the translation
   for you, so just don't execute it!

This is sloppy programming on Borland's part.  (Not surprising!  Borland has
been going steadily downhill for a while now.)

   Oh, I guess its not conforming code, but does anyone really use
   trigraphs? Really, anyone?

Yes.  I, for one, used them rather recently on an IBM 3090, with WATCOM C/370
(which purports to be ANSI-compliant).  The main reason for this is that IBM
3178 and 3179G terminals have no keys for square brackets.  As a C programmer,
I consider this extremely annoying.  B-/

steven@pacific.csl.uiuc.edu (Steven Parkes) (05/31/91)

|> >puts("               4) ??\!");

|> >2) and 3) should work OK, [...] I'm not quite sure about 4)

|> It works.

(4) is not guaranteed to work because '\!' is not a valid ANSI escape.  The
result of using '\!' (or any other invalid escape sequence) is undefined.

exspes@gdr.bath.ac.uk (P E Smee) (05/31/91)

In article <m0jizgb-0001OzC@dharma.reed.edu> minar@reed.edu writes:
> Oh, I guess its not conforming code, but does anyone really use
>trigraphs? Really, anyone? If I had a keyboard that wasn't fully C
>capable, I'd certainly use something, but it wouldn't be trigraphs,
>unless I had to ship the code elsewhere..

Well, if you are forced (as I was a year or two ago) to use C on big
IBM iron (3090 running VM/CMS) you find yourself more or less forced
into using trigraphs.  A large number of the characters which are
important to C {}[]| and probably a few I've forgotten are not,
pragmatically, useable, because the EBCDIC codes for them are not well
defined.  (There are somthing like 9 possible 8-bit values for each of
these chars, which in some cases overlap -- i.e. [ in one encoding will
be the same as ] in another.  Different sub-models of 3270 keyboards
use different encodings, and different programs ditto.

-- 
Paul Smee, Computing Service, University of Bristol, Bristol BS8 1UD, UK
 P.Smee@bristol.ac.uk - ..!uunet!ukc!bsmail!p.smee - Tel +44 272 303132

jfw@ksr.com (John F. Woods) (05/31/91)

mycroft@goldman.gnu.ai.mit.edu (Charles Hannum) writes:
>   and Borland doesn't even have trigraphs in its compiler anymore,
>   there's a program called trigraph.com that does the translation
>   for you, so just don't execute it!
>This is sloppy programming on Borland's part.

Why?  Does gcc have an option to do trigraph translation, and ONLY trigraph
translation, so that you can import a "portable character set" program into
your local character set so you don't have to waste the time preprocessing
trigraphs over and over again?  If so, why is it part of gcc?  Is it really
true that the Free Software Foundation is a secret conspiracy by memory
manufacturers?

gwyn@smoke.brl.mil (Doug Gwyn) (06/01/91)

In article <m0jizgb-0001OzC@dharma.reed.edu> minar@reed.edu writes:
>>>> What if I want to use the sequence "??!" within a string?
>easy. Use a compiler that is fully ANSI, but lets you turn off
>trigraphs.

Excuse me -- if a compiler variant does not support trigraphs then
it should not be called "ANSI" (as in "conforming to the ANSI C standard").
Standard conformance is a well-defined and useful property; let's not
introduce confusion about what the phrase "ANSI C" means by applying
it to deviant implementations.

That being said, I also have to take issue with the idea of relying
on a compiler "switch" for correct program semantics.  There are
standard ways to meet the requirement; no need to resort to such kludgery.

diamond@jit533.swstokyo.dec.com (Norman Diamond) (06/03/91)

Attribution lost (Mr. Weizel, I think):
>>>puts("               4) ??\!");
>>>2) and 3) should work OK, [...] I'm not quite sure about 4)

My mistake:
>>It works.
[and another mixed pair, see below]

In article <1991May31.133330.1149@roundup.crhc.uiuc.edu> steven@pacific.csl.uiuc.edu writes:
>(4) is not guaranteed to work because '\!' is not a valid ANSI escape.  The
>result of using '\!' (or any other invalid escape sequence) is undefined.

Right, sorry.  And my other mistake,
puts("               6) ????/!");
is not guaranteed to work either, but
puts("               7) ???/?!");
is.  And if you need trigraphs in the first place, then (7) is the ONLY
dependable way of obtaining the originally requested character sequence.
(Though again, string concatenation still works if you don't need the
two-question-marks-and-a-bang at preprocessing time.)
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.
Permission is granted to feel this signature, but not to look at it.

gwyn@smoke.brl.mil (Doug Gwyn) (06/04/91)

In article <1991Jun3.011539.17430@tkou02.enet.dec.com> diamond@jit533.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>...  And if you need trigraphs in the first place, then (7) is the ONLY ...

Actually, one never NEEDS trigraphs; they're required to be supported as a
convenience when interchanging source code among sites or equipment with
poor support for the C source character set.  However, all conforming
implementations MUST support the full C source character set (as specified
in X3.159-1989 section 2.2.1) in addition to trigraph sequences (2.2.1.1).

sef@kithrup.COM (Sean Eric Fagan) (06/04/91)

In article <16332@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>However, all conforming
>implementations MUST support the full C source character set (as specified
>in X3.159-1989 section 2.2.1) in addition to trigraph sequences (2.2.1.1).

And the only "standard" way to *get* trigraphs is to use the 'c89' command.
Since ANSI C *can't* specify how to get full conformance, having something
like

	filter_trigraphs file.c | pre_processor | check_syntax | compile

is a perfectly "valid" way to get a conforming compiler.  Unless, as noted
previously, you're on a POSIX-compliant system, in which case you will be
able to use the aforementioned c89 command.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

diamond@jit533.swstokyo.dec.com (Norman Diamond) (06/05/91)

In article <16332@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>In article <1991Jun3.011539.17430@tkou02.enet.dec.com> diamond@jit533.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>>...  And if you need trigraphs in the first place, then (7) is the ONLY ...
>
>Actually, one never NEEDS trigraphs; they're required to be supported as a
>convenience when interchanging source code among sites or equipment with
>poor support for the C source character set.

Well, if one has equipment with poor support for the C source character set,
then one either NEEDS trigraphs, or (worse) one needs to use a human-readable
and human-writable graphic mechanism that will be ad-hoc because the committee
refused to standardize one.  Anyway, clearly I was referring to cases with
equipment that does not support the C character set (and to cases where the
implementation is designed for such equipment, even if such equipment is not
used at that moment).

>However, all conforming
>implementations MUST support the full C source character set (as specified
>in X3.159-1989 section 2.2.1) in addition to trigraph sequences (2.2.1.1).

Does this mean that in a national character set that doesn't have [|\ etc.,
and a proposed implementation accepts the entire national character set
plus trigraphs, failure to support [|\ etc. through direct means without
trigraphs will make the implementation non-conforming?

I thought that the exact opposite of this was previously decided, that
trigraphs were sufficient in such cases.
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.
Permission is granted to feel this signature, but not to look at it.

gwyn@smoke.brl.mil (Doug Gwyn) (06/05/91)

In article <1991Jun5.005958.9597@tkou02.enet.dec.com> diamond@jit533.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>Does this mean that in a national character set that doesn't have [|\ etc.,
>and a proposed implementation accepts the entire national character set
>plus trigraphs, failure to support [|\ etc. through direct means without
>trigraphs will make the implementation non-conforming?

The C source character set, in terms of the standard, has no necessary
relation to the code set used for "the national character set", although
in many current implementations there happens to be a close relationship.
The standard says the specified set of C source characters, which are the
result of conversion from external representations, must be supported,
but it doesn't specify details of the conversion.  This is independent
of the trigraph mapping that occurs at a slightly phase of translation.

You might recall the "Software Tools" Ratfor translator implementation;
it converted external source file characters, which might for example be
coded in EBCDIC, to a universal internal form (which happened to be ASCII
in that example), to take advantage of the known properties of the internal
representation (e.g. contiguity of the alphabetic character codes) for fast
processing, and converted back to external characters upon output (it was a
text-to-text format translator, not a true compiler).  C compiler
implementations [that don't need to provide support for Japanese-style
multibyte character encodings in C source files] could readily map whatever
the site-specific conventions are for external representation of funny
characters (traditionally displayed as vertical-bar glyph, etc.) used in
C source code to internal, more convenient (probably 7-bit code) form for
subsequent processing.  Such conventions are not the business of any C
standard; they necessarily depend on highly site-dependent characteristics.

>I thought that the exact opposite of this was previously decided, that
>trigraphs were sufficient in such cases.

No, and trigraphs are a horrible invention that seems to mainly serve to
lead people in the wrong direction for coping with character set problems.
While trigraphs may at first appear to "solve" the code set issue by
permitting translation of any strictly conforming C source to a "lowest
common demoninator" set of characters that even ISO 646 sites claim to
support, in order to accomplish this function one needs to come up with
utilities for translating generic C into full trigraphed form, as well as
the inter-code set text file tranfer and translation facilities that are
always necessary for data interchange.  The latter have to solve the
differing-code-set issues anyway.  (Note that there are radically
different code sets among the sites that receive this newsgroup; to a
large extent similar problems have already been overcome in the
development of internetworking.)

If I may add a general observation about code set issues, particularly
multibyte encodings:  It seems to me that the people designing software
facilities, hardware, and standards concerning the issues generally fail
to appreciate a crucial design point:  The sooner you can map everything
into a uniform format with simple, clean, properties, the better off you
are.  Instead, we keep seeing designs that require the users of the
services to face algorithmic complexity, because the data being operated
upon has been left in a complex encoded form instead of being turned into
the previously mentioned uniform format with nice properties.  Algorithms
naturally reflect the underlying structure of the data.  If you'd like to
be able to code programs that deal with text in a simple manner, as seen
in early UNIX utilities such as "wc", you need to keep the form in which
text is seen by program code as simple as possible; for example, all text
characters must be handled as one "character" type, a complete unit of
which would be returned per call to getchar(), obviating the need for
wchar_t and the (rapidly growing) library of functions for helping
applications deal with nonunitized, fragmented, and stateful characters.
This would mean that in some environments a 16-bit datum would be
required for representing a single character, but we ended up with that
anyway, in the form of wchar_t, without the benefits of a simple program
interface to text units.  Since the character problems have to do with
people, their details should be pushed as far out from the application
(thus as close to the users) as possible.  I think X3J11, prompted by
certain vendors who were already committed to complicated solutions,
made the wrong choice here.  I would hope that other computer engineers
learn from this example how not to "solve" such problems.

Andreas.Kaiser@f7014.n244.z2.stgt.sub.org (Andreas Kaiser) (06/07/91)

 >Actually, one never NEEDS trigraphs; they're required to be 
 >supported as a convenience when interchanging source code among sites or 
 >equipment with poor support for the C source character set.

There are character sets, which do not support all C language characters, such 
as braces and brackets (example: 7-bit german). While it is usually possible to

use the characters corresponding to these ASCII codes instead (...but almost 
unreadable), it is likewise possible that some text file exchange program 
silently converts these special characters into PC, ISO or ECMA 8-bit 
equivalents. Trigraph characters are available in all roman-style character sets

and will be understood by all machines.

While I never saw a machine-readable program actually using these trigraphs, I 
saw printed demo programs in german books or student lessons, where the braces, 

brackets, backslashes etc. were added by pencil, because german typewriters 
usually do not support them. 

While trigraphs in a C program are likewise unreadable as the german umlauts, 
they can be hidden by macros without accidentally undergoing invalidating 
conversions.

                Gru!!s, Andreas

 
 
 

 * Origin: kaiser@ananke.stgt.sub.org - Stuttgart FRG (2:244/7014)

gwyn@smoke.brl.mil (Doug Gwyn) (06/10/91)

In article <676248139.0@ananke.stgt.sub.org> Andreas.Kaiser@f7014.n244.z2.stgt.sub.org (Andreas Kaiser) writes:
> >Actually, one never NEEDS trigraphs; they're required to be 
> >supported as a convenience when interchanging source code among sites or 
> >equipment with poor support for the C source character set.
>There are character sets, which do not support all C language characters, such 
>as braces and brackets (example: 7-bit german). While it is usually possible to
>use the characters corresponding to these ASCII codes instead (...but almost 
>unreadable), it is likewise possible that some text file exchange program 
>silently converts these special characters into PC, ISO or ECMA 8-bit 
>equivalents. Trigraph characters are available in all roman-style character sets
>and will be understood by all machines.

I wouldn't promise that codes for the "trigraph characters" are
universally available, for example in CDC "display code".  However,
they were deliberately chosen from the glyphs that are supposed to
have corresponding codes in all realizations of ISO 646 (rapidly
becoming obsolete).

You miss my point, though.  The "C source character set" is NOT,
repeat NOT, the same as whatever character code set is used for
external representation of text characters on a given system.
The C source characters are the result of a TRANSLATION from
external representation to some form internal to the compiler.
In many compiler implementations, this translation uses the same
encoding, but that is NOT a requirement of the standard, and the
freedom to translate one or more external characters to a C
source character can be exploited to nicely support the "difficult"
source characters that often have no "offical" external equivalents.

In other words, constructs in Courier font in the C standard need
not be thought of as unmapped external source code representations,
but rather are (a) the internal form of the program after the first
part of translation phase 1, or (b) an external form after some
translation utility has been applied to the external source code to
present it decently on a printer or more than minimal terminal.
The printers and terminals I use, for example, all allow downloading
of font bitmaps, and so long as there are enough code values, which
there always are for 7-bit code sets, I can select those that will be
used to represent the glyphs for "vertical bar" etc.  If the compiler
supports the same conventions (as in fact most existing compilers do,
where the USASCII code values are adopted), everything works well.
Problems arise only when one's system has neglected to provide decent
facilities for input/editing/display of the "difficult" glyphs using
the adopted mapping conventions.

gwyn@smoke.brl.mil (Doug Gwyn) (06/11/91)

In article <676362343.52@egsgate.FidoNet.Org> Sean.Eric.Fagan@f98.n250.z1.FidoNet.Org (Sean Eric Fagan) writes:
>And the only "standard" way to *get* trigraphs is to use the 'c89' command.
>Since ANSI C *can't* specify how to get full conformance, having something
>like
>	filter_trigraphs file.c | pre_processor | check_syntax | compile
>is a perfectly "valid" way to get a conforming compiler.  Unless, as noted
>previously, you're on a POSIX-compliant system, in which case you will be
>able to use the aforementioned c89 command.

Each (standard) conforming implementation is FULLY conforming.
It is true that the vendors might make it possible for users to obtain
partially functional (non-standard) behavior if they do not invoke the
conforming implementation, but that's just implementation garbage, not
relevant to the discussion of standard conformance.

By the way, POSIX.2's "c89" should have been called "cc"...