[comp.lang.c] trigraphs in X3J11

rcd@ico.ISC.COM (Dick Dunn) (05/20/88)

I've etalked to a few people about this, but I'd like to see if there's
more info floating around.

Background: "Trigraphs" in dpANS C are a way of avoiding the problems of
character-set restrictions, by introducing 3-character replacements for
those characters which are required for C but do not exist in the ISO 7-bit
set.  For example, if your character set doesn't have braces {}, you can
use ??< and ??> to denote them.  The behavior is as if trigraphs were
replaced by the corresponding single characters in a prepass to the compiler,
*including* replacement within strings.  All trigraphs begin with "??".

The draft standard seems to be written in such a way that a compiler MUST
accept these trigraph sequences.  I'm perplexed on a couple of points here.

1.  Replacement within strings:  This is a change to the existing language.
    It breaks existing programs.  I looked through existing source code
    that we have here and found several programs which get broken or
    significantly altered.  Here's an example--sanitized, but typical of
    what can happen.  Suppose you now have:
	printf("bad status ??<%x>??--device %n\n", st, dev);
    What you're going to get, according to the draft standard, is something
    that has the effect of:
	printf("bad status {%x>~-device %n\n", st, dev);
    Point:  The sequence "??" is not at all rare.  Why was it chosen as the
    introducer?  (I think people who start getting messages about using
    `/dev/tty^ are going to be confused.)

    Note also that it is common practice to use "?" in initializing strings
    where the "?" positions will be replaced at execution time.  Pity the
    poor programmer who sets up something like:
	char	ta[] = "/tmp/d?????/a",   tb[] = "/tmp/d?????/b";
    and discovers (eventually) that these strings are each two characters
    shorter than they used to be; if he tries to replace the ?s, he'll
    write off the ends of the strings!

    NOW, before you light 'em up and blast me, YES, I realize it's a hard
    problem.  There aren't many safe character sequences to use--and YES, I
    know that you can't use backslash because that's one of the possibly-
    missing characters.  What I don't understand is why it was decided to
    introduce a brand-new (I assume) mechanism which breaks existing code.

2.  Replacement in program text:  My philosophical objections to
    replacement of trigraphs within a program are much less...but I wonder
    who might ever use them.  Is there any precedent for these sequences?
    Is there any reason to think they'll be used?  Let's take another
    (slightly contrived but realistic) example here--I'll construct a
    piece of code which says, roughly, "If the first character of `line'
    is a sharp or percent, call function prepro to handle the rest of the
    line, then increment linect".  We would now write this as:

	if (line[0]=='#' || line[0]=='%') {
		prepro(&line[1]);
		linect++;
	}

    Replacing all the nasty characters with corresponding trigraphs gives:

	if (line??(0??)=='??=' ??!??! line??(0??)=='%') ??<
		prepro(&line??(1??));
		linect++;
	??>

    I submit that this will produce code which is so near to unreadable
    that there is virtually no prospect of the mechanism ever seeing
    significant use.  If you believe that, you have to wonder why every
    standard compiler should have to carry the extra baggage.  If you don't
    believe that, I'd like to see some real evidence to show that
    programmers might use it.

A general question:  Has the trigraph mechanism been tried out, in real
practice, anywhere prior to the introduction in X3J11?  If so, I'd like to
hear about how it's worked out.
-- 
Dick Dunn      UUCP: {ncar,cbosgd,nbires}!ico!rcd       (303)449-2870
   ...Never attribute to malice what can be adequately explained by stupidity.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (05/20/88)

In article <5215@ico.ISC.COM> rcd@ico.ISC.COM (Dick Dunn) writes:
>The draft standard seems to be written in such a way that a compiler MUST
>accept these trigraph sequences.

Yes, a standard-conforming implementation MUST understand trigraphs.

>1.  Replacement within strings:  This is a change to the existing language.
>    It breaks existing programs.  ...
>    Point:  The sequence "??" is not at all rare.

Trigraphs ARE relatively rare in existing code.  Yours is the first
example I've seen, in fact.  Most applications think ? should be used
as a question mark in messages, perhaps ?? at the end of a few message
strings or in a chess program.

>    Why was it [??] chosen as the introducer?

Because all single characters in the ISO invariant code set already had
valid C meanings.  Many double-character sequences also already have
meanings.  ?? seemed to cause the least disruption to existing practice.

>    What I don't understand is why it was decided to
>    introduce a brand-new (I assume) mechanism which breaks existing code.

Because nobody, including you, has proposed anything that the Committee
agreed was better, and many C users (for example, Europeans) have a
perceived need that the parochial American outlook does not meet.

The point is that existing practice was deemed unsatisfactory, so
SOMEthing had to change.  X3J11 tried to minimize the impact of this
"quiet change".

>Has the trigraph mechanism been tried out, in real practice, anywhere
>prior to the introduction in X3J11?

This specific mechanism is an invention of X3J11, so far as I can
determine.  However, use of multi-byte sequences to encode things
that cannot be represented by a single byte is extremely common
practice.

Note, by the way, that I oppose trigraphs, but I can provide a definite
explanation of how the European needs can be met without them, just as
I can explain how the Japanese needs can be met without introducing the
wchar_t stuff.  My feeling is that people develop mindsets based on
previous non-optimal design that precludes their understanding what an
optimal design would be like.  Probably the difficulty of learning how
to deal with a kludge causes a psychological investment that is hard
to give up.

None of the above, of course, should be construed as official X3J11
information.

alan@Apple.COM (Alan Mimms) (05/21/88)

Perhaps the best solution to the trigraph dilemma is to make available some
public-domain filters for converting from- and to- the trigraph notation.
This would permit those unfortunate enough to have strange character sets
to write C code and to port that code to a machine whose C compiler does
NOT support the trigraph notation and back again with minimal pain.

I BELIEVE I understand that the trigraph notation is a simple transformation
of the normal ASCII-based C notation.  Consequently, it should be quite
simple to convert in both directions.  The only problem might be in strings
in programs which produce C programs as their output -- in which case, the
filters come to the rescue by converting the program's output before it is
compiled.

Doesn't this make most of the flamers happy?

-- 
Alan Mimms			My opinions are generally
Communications Products Group	pretty worthless, but
Apple Computer			they *are* my own...
...it's so simple that only a child can do it!  -- Tom Lehrer, "New Math"

chuck@eneevax.UUCP (Chuck Harris) (05/21/88)

In article <5215@ico.ISC.COM> rcd@ico.ISC.COM (Dick Dunn) writes:
>
>2.  Replacement in program text:  My philosophical objections to
>    replacement of trigraphs within a program are much less...but I wonder
>    who might ever use them.  Is there any precedent for these sequences?

Yes, back in the olden days, some implementations of APL had a "digraph"
character set that was composed of combinations of "$" and another char.
I <<had>> to use this set while I was at UoM, using Model 33's.  It was pretty
disgusting, but worked.
Our particular implimentation was controlled by an option flag, so it didn't
harm native mode APL work.  A clear deficiency in the ANSI proposal.

>    Is there any reason to think they'll be used?  Let's take another

Not in my opinion.  the "digraph" set was simple enough, and APL's needs
easy enough to accomodate, that it didn't cause any real confusion.
(you ended up with something that looked a little like DEC's FOCAL)

APL needed an <- ($P), matrix divide ($#), "lamp" ($.), delta ($F) (or was
it $D), index ($I), ...  It's been so long, I forget most of it.

"C" uses a very rich set of characters, even when compared with APL.
Many of its most used characters are not representable in ISO (eg. {|\}[])

I LOVE IT!! 8-)

>
>	if (line??(0??)=='??=' ??!??! line??(0??)=='%') ??<
>		prepro(&line??(1??));
>		linect++;
>	??>
>
>    I submit that this will produce code which is so near to unreadable
>    that there is virtually no prospect of the mechanism ever seeing
>    significant use.  If you believe that, you have to wonder why every
>-- 

The last time I railed about Trigraphs, I caused quite a stir.  I gave
a few examples of the garbage that would result, likened the use of 
trigraphs to the techniques used to "enhance" the deficiencies of the
old Model 33 TTY, called the offending ISO terminals "Braindamaged",
ranted and raved about how simple it was for anybod who was stuck
with the ISO terminals to implement their own "trigraph" preprocessor
and leave the language intact.

For my efforts, I got called a "Chauvanistic American" , a fool, and a few
other things that might have harmed my EGO.

So, outside of it being too late to change things, there is NO way that
I will risk post anything on this subject. :-)

		Chuck Harris
		C.F. Harris - Consulting

beckenba@cit-vax.Caltech.Edu (Joe Beckenbach) (05/21/88)

---

	I'm not sure how nit-picky a detail this is, but the impression I've
gotten from the trigraph postings of late is that the compiler would rather
not deal with it.
	Isn't that what a preprocessor is for?
	(Or is the preprocessor considered part of the compiler?)

	For C code using trigraphs, I assume that judicious use of spacing
will ease matters, eg
	{int garbage[MAX];}
goes to 
	??<  int garbage??( MAX ??) ; ??>
much more legibly than a direct substitution without spacing
	??<int garbage??(MAX??);??>
Of course trigraphs mean more characters in a source file. But spaces and
\n's are cheap, or at least were until the compiler broke. :-)

	BTW, I had to program on an old IBM4381 workstation in Pascal. There
was no way to get the curly braces AT ALL from the keyboard; the language
support kludge was a trigraph sequence. It worked, but it was hard to spot
the comments for a while. The machine could display the curly braces, but
the machine couldn't generate them from any of the input devices! [Bad
design in action. :-( ]

-- 
Joe Beckenbach	beckenba@csvax.caltech.edu	Caltech 1-58, Pasadena CA 91125
Graduating in June, knowing that C ain't bad, tools exist and are useful, 
	and that digital watches could be a neat idea. :-)

gwyn@brl-smoke.ARPA (Doug Gwyn ) (05/21/88)

In article <10626@apple.Apple.Com> alan@apple.UUCP (Alan Mimms) writes:
>Perhaps the best solution to the trigraph dilemma is to make available some
>public-domain filters for converting from- and to- the trigraph notation.

Since the mapping is done during translation phase 1, this should be
feasible.  I would also suggest removing #pragma lines.

mcdonald@uxe.cso.uiuc.edu (05/21/88)

The solution to the trigraph botch is simple: have compiler vendors
make it optional. Provide an option in the "install" program for the
compiler to turn it only only if the user wants it. Otherwise, don't
do it. I don't see why it is necessary, anyway. The character set of
ANSI C is presumably the ANSI  character set ( Oh! can you say EBCDIC?
Not if you don't want to throw up on your keyboard! Besides, EBCDIC has
plenty of characters.) If someone wants to use a non-standard set, let them
find appropriate characters. I don't care if it is in the standard, so
long as it doesn't appear in my compiler.

Doug McDonald

nather@ut-sally.UUCP (Ed Nather) (05/23/88)

In article <10626@apple.Apple.Com>, alan@Apple.COM (Alan Mimms) writes:
> 
> Doesn't this make most of the flamers happy?
> 
By definition, *nothing* can make a flamer happy.  You may be able to
satisfy a few grumps or malcontents, but a true flamer yields to no
solution.  If one ever did, he would be exiled to Bitnet.

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nather@astro.AS.UTEXAS.EDU

jas@rain.rtech.UUCP (Jim Shankland) (05/23/88)

And then there's what Stallman has to say about trigraphs, in *Internals
of GNU CC*:

	You don't want to know about this brain-damage.


Jim Shankland
  ..!ihnp4!cpsc6a!\
               sun!rtech!jas
 ..!ucbvax!mtxinu!/

henry@utzoo.uucp (Henry Spencer) (05/23/88)

> Our particular implimentation was controlled by an option flag, so it didn't
> harm native mode APL work.  A clear deficiency in the ANSI proposal.

I would expect that this is the way many C compilers will implement trigraphs;
I know of some that already take that approach.  Lots of people share your
view that trigraphs are ugly.
-- 
NASA is to spaceflight as            |  Henry Spencer @ U of Toronto Zoology
the Post Office is to mail.          | {ihnp4,decvax,uunet!mnetor}!utzoo!henry

davidsen@steinmetz.ge.com (William E. Davidsen Jr) (05/23/88)

In article <7937@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
| In article <5215@ico.ISC.COM> rcd@ico.ISC.COM (Dick Dunn) writes:
| >The draft standard seems to be written in such a way that a compiler MUST
| >accept these trigraph sequences.
| 
| Yes, a standard-conforming implementation MUST understand trigraphs.

  This is the first case I've seen where the committee really blew it in
my opinion (yes I could live with noalias).  I completely agree with the
need to do this, but as currently implemented will cause a number of
problems. 

  A preprocessor function could specify the trigraph inducer, with the
default "none" to avoid breaking existing programs.  The committee seems
to have lost sight of that goal in this case.  The same functionality
could be provided by a new preprocessor function (can't break existing
programs). 

Consider:
	#trigraph ??

  Now your program can run on my machine, using the notation you used.
If I choose, I can run it through a filter and convert to full ASCII.
Better yet, I can take my existing programs and convert them before
sending them to you.

  Why do it this way? If I want to send you a program of mine, which I
wrote filled with the ?? sequence **like many machine control programs
which have to get certain ASCII characters to the device** I can give
you another sequence:
	#trigraph TX

  This is ugly as hell, but it will let you edit the program, and not
break it.

  PLEASE X3J11, fix this sucker! It CAN be done without breaking
existing programs.  It makes more sense in the preprocessor.  Best
reason is that as specified it will lead to compilers which don't do
full ANSI by default, or even subset compilers. 

  I scanned my local source directory and found three programs of 102
which would break. I don't know if that's typical, but why do it wrong
when it can be done another way. I did NOT scan the directory of
programs which do device control, since I have made that point and every
one would break and have to be handcoded with escape sequences, etc, do
get by this.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

karl@haddock.ISC.COM (Karl Heuer) (05/24/88)

In article <1988May23.000451.751@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>>[It should be controlled by an option flag]
>I would expect that this is the way many C compilers will implement trigraphs;
>I know of some that already take that approach.  Lots of people share your
>view that trigraphs are ugly.

I countersuggest that the compiler should always recognize trigraphs, and
issue a warning message if any are encountered.  Then add an option to supress
this warning.

This way, the compiler would still be conforming, and in the unlikely event
that some of my code uses a string containing two question marks followed by
one of the magic characters, I'd find out about it.

(This is based on the assumption that trigraphs stay in.  I'd prefer that they
be removed, but in any case I don't expect them to get in my way.)

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

lenoil@Apple.COM (Robert Lenoil) (05/24/88)

In article <5215@ico.ISC.COM> rcd@ico.ISC.COM (Dick Dunn) writes:
>    Note also that it is common practice to use "?" in initializing strings
>    where the "?" positions will be replaced at execution time.

Dick is dead right here.  What is the justification for breaking existing
programs when the ability to include untypeable characters into strings already
exists via the \xxx mechanism?  Instead of introducing a totally new notion
(to C, anyway) of trigraphs, why not simply extend the backslash escape
mechanism to be valid outside of strings?  This would allow the use of #defines
to perform the same function as trigraphs:

#define ??< \173	/* open brace */
#define ??> \175	/* close brace */

By using the backslash escapes in strings and your favorite synonym outside of
strings, the same effect is reached without breaking any existing code.  If
people don't want to use the backslash escapes in strings, they can make use of
the new stringizing operators to get the #define'd constants into their
strings.

Robert Lenoil
Apple Computer, Inc.

jss@hector.UUCP (Jerry Schwarz) (05/24/88)

In article <10941@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
>
>  PLEASE X3J11, fix this sucker! It CAN be done without breaking
>existing programs.  It makes more sense in the preprocessor.  Best
>reason is that as specified it will lead to compilers which don't do
>full ANSI by default, or even subset compilers. 
>

Attached below is the text of the official committee response  to
Letter P02 (during the first public review period around a year ago).

The standard is now in the third public review period and the
committee is only accepting comments on changes made between the
second and third drafts.  Thus Trigraphs (which were accepted very
early in by the committee and were discussed extensively in this
group several times since) are almost certainly going to be in the
final version.

I can understand that it may be frustrating for someone to come upon
the proposed standard today, see something they don't like, and feel
it is being rammed down their throats without due consideration.
Especially if they think they have a better way to solve the problem.
However, I hope such people will try to understand that the process
of creating a standard goes on for a long time and that suggestions
made toward the end of the process may not receive the same
consideration as suggestions made earlier.

For the record: I think trigraphs are a bad idea.

Jerry Schwarz

-----------------------------

X3J11 Response to Letter P02

Summary of Issue: Eliminate trigraphs.

Committee Response:

The Committee has reaffirmed this decision on more than one occasion.

The Committeee discussed alternatives to trigraphs on a number of
occassions, but always decided that they fill a need.  C must support
a wide variety of terminals and keyboards many of which lack the full
C character set.

thorinn@diku.dk (Lars Henrik Mathiesen) (05/24/88)

As one who regularly uses a non-ASCII terminal setup, I'd better explain
a little. In Danish (my native language) we have three `extra' letters
which we much prefer to use when writing Danish text - it is possible to
get by with two-letter replacements, but it's not very readable. By the
way, these are not `accented letters;' they are separate letters of the
alphabet, with their own place at the end of the sorting sequence. Much
the same applies to German, Swedish, Norwegian, and many other European
languages.
  That's not usually a problem as most modern terminals have provisions
for various national character sets, which are defined in an ISO standard.
This standard allows the glyphs at some eight or ten positions to vary,
including @, $, [, \, ], {, | and }. The latter six are used for the non-
ASCII letters in Danish, as they follow the other letters nicely.

  So, the X3J11 people think, the poor Europeans can't use ASCII: we'll
have to invent some kludge to bring C to their benighted shores. The only
excuse for inventing something so horrible is that it only breaks a very
few programs, and that it won't be used anyway.
  You see, over here we get by just fine without trigraphs. The less
fortunate are stuck with a national character set, and have to put up
with seeing the various punctuation as letters - they are not as visually
distinctive (and the brackets and braces don't pair naturally), but with
a little attention to layout one gets by quite well. And it's _much_ better
than trigraphs.
  The lucky ones have terminals which can switch between ASCII and national
character sets. If not for the warped minds of the terminal manufacturers,
this would be the perfect solution. But we (at this institute) have yet to
see a terminal with an escape sequence to switch character sets, or (and
this is worse) one whose keyboard layout did _not_ change with the character
set shown on the screen. (And none of them had LCD keytops). So we have to
pay the importer to hack new PROMs to enable us to switch without moving the
keys around. But I digress.
  By the way, I find that it's easier to read Danish with ASCII characters
than it is to parse convoluted C code in Danish characters, so I hardly
ever bother to switch any more.

To make it pleasant to use C and national letters in the same file, there
would have to be _convenient_ replacements for the ASCII characters in
question, and it would have to allow the national letters to be used in
identifiers (trigraphs don't). This cannot be done as an extension of the
ASCII C input format because the national letters are punctuation in ASCII.
  Now we're talking about an alternate input format for C - we'll have to
tell the compiler if a given source file is in the `old' or the `new' format.
On the other hand this frees us to use extra keywords etc. The new format
shouldn't use any characters that may be replaced in national character sets.
  The tokens [ ] { } | || (and in some compilers |=) must be replaced; one
off-the-cuff possibility is (. .) beg end or cor (or=). We need a new
pre-processor escape and a new string escape, which can't very well be
keywords. // might be a possibility for both, as it's rare in C, but does
it look too much like JCL?
  This new format could probably be implemented by a little lex pre-pre-
processor; national characters in identifiers would have to be encoded
somehow (e.g. using Q as an escape), increasing the identifier length.
This would cause problems with symbolic debuggers and short-name compilers,
but could easily be retrofitted on old compilers (write your own cc ...).
Oh well, it wouldn't be portable anyway. Hey, anybody from GNU reading this?

By the way, Standard Pascal is designed to be possible to write without
specific ASCII characters: It allows (. .) for [ ] (indexing), and (* *)
for { } (comments). Since e.g. .5 is a legal constant, this may cause
unexpected parse errors for programmers who're unaware of the feature.
--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark      [uunet!]mcvax!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.

chris@mimsy.UUCP (Chris Torek) (05/25/88)

In article <10949@apple.Apple.Com> lenoil@Apple.COM (Robert Lenoil) writes:
>... why not simply extend the backslash escape mechanism to be valid
>outside of strings?

Backslash is one of the characters that cannot be represented in some
character sets (the trigraph ??/ is a synonym for it in the dpANS).

>This would allow the use of #defines to perform the same function
>as trigraphs:
>
>#define ??< \173	/* open brace */
>#define ??> \175	/* close brace */

This would be almost as big a change as trigraphs; the #define syntax
is now

	# define <identifier><arglist_opt> <replacement-text>

and `??' is not part of an <identifier>.

I think the `#trigraph' suggestion is a suitable way to keep trigraphs
from affecting old code and/or infesting new code.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

karl@haddock.ISC.COM (Karl Heuer) (05/25/88)

In article <10949@apple.Apple.Com> lenoil@apple.UUCP (Robert Lenoil) writes:
>Instead of introducing [trigraphs], why not simply extend the backslash
>escape mechanism to be valid outside of strings?

That's an easy one: backslash is one of the characters that may not exist!

>This would allow the use of #defines
>#define ??< \173	/* open brace */
>#define ??> \175	/* close brace */

Not unless you extend the preprocessor's notion of what constitutes a valid
macro name.  Note also that the magic constants \173 and \175 are unportable.

In article <10941@steinmetz.ge.com> davidsen@steinmetz.ge.com (William E. Davidsen Jr) writes:
>I scanned my local source directory and found three programs of 102 which
>would break. ... I did NOT scan the directory of programs which do device
>control, since I have made that point and every one would break and have to
>be handcoded with escape sequences, etc, do get by this.

Are you sure you have that many programs that would break?  Note that `??'
alone is not a problem; it becomes a trigraph only when followed by one of the
nine characters "=(/)'<!>-".  (Unlike backslash, which is reserved even if the
following character is unrecognized.)

Assuming trigraphs stay in, the fix is simple: filter your code through
	sed -e "s;??\\([-=(/)'<!>]\\);?\\\\?\\1;g"
as part of the ANSIfication process.  (Better yet, do it now before you run
into a compiler with trigraphs.  It won't hurt, unless your current compiler
complains about the unrecognized escape "\?".)

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

rcd@ico.ISC.COM (Dick Dunn) (05/25/88)

> > Our particular implimentation was controlled by an option flag, so it didn't
> > harm native mode APL work.  A clear deficiency in the ANSI proposal.
> I would expect that this is the way many C compilers will implement trigraphs;
> I know of some that already take that approach.  Lots of people share your
> view that trigraphs are ugly.

Adding a compiler switch may be the "least bad" solution that a rational
compiler writer can find...but it gives programmers an ugly choice:  The
compiler either transmogrifies trigraphs or it compiles programs in a
nonstandard way.

If the programmer writes:
	printf("What on earth??!\n");
a standard-conforming compiler should produce code which will cause the
program to print:
	What on earth|
If instead it produces code which causes the program to print:
	What on earth??!
it's violating the standard.

If you're compiling your own code, you know when to turn on the trigraph
switch on the compiler...but if you're compiling jrandom.c that you got on
a tape from somebody, what do you do?  Is it a standard program?  Was it
written before the standard came out?  (There are a couple of files in the
netnews source which fall into just this hole.)
-- 
Dick Dunn      UUCP: {ncar,cbosgd,nbires}!ico!rcd       (303)449-2870
   ...If you get confused just listen to the music play...

ok@quintus.UUCP (Richard A. O'Keefe) (05/25/88)

In article <10949@apple.Apple.Com>, lenoil@Apple.COM (Robert Lenoil) writes:
> Instead of introducing a totally new notion
> (to C, anyway) of trigraphs, why not simply extend the backslash escape
> mechanism to be valid outside of strings?

Because backslash itself is one of the missing characters.
(This is all fixed in the ISO 8859 character set family anyway.)

rcd@ico.ISC.COM (Dick Dunn) (05/25/88)

In article <10949@apple.Apple.Com>, lenoil@Apple.COM (Robert Lenoil) writes:
> Dick is dead right here.  What is the justification for breaking existing
> programs when the ability to include untypeable characters into strings already
> exists via the \xxx mechanism?...

The problem is that backslash is one of the characters which does not exist
in the European character sets of concern!  You can't use backslash to
dodge the problem, because you don't have backslash!  (That is one of the
little nits that makes it such a nasty problem.)
-- 
Dick Dunn      UUCP: {ncar,cbosgd,nbires}!ico!rcd       (303)449-2870
   ...If you get confused just listen to the music play...

davidsen@steinmetz.ge.com (William E. Davidsen Jr) (05/25/88)

I'm not asking for the removal of the feature, I'm pointing out that it
is currently done in a way which breaks existing programs, and that
there are ways to prevent that from happening.

I was on the committe for the first two years, and I can't find any
references to trigraphs in my old notes. Bill Plauger's original comment
on things like this (from my notes on the Washington meeting) was that
"we should not egregiously break existing programs." I think that the
current implementation is a major deviation from that philosophy,
justified only if there is no other way.

As for last minute things, the vendors wanted to add noalias at the last
minute to allow better code generation (I actually didn't object to
that) so changing the implementation of a feature which (a) no one is
currently using, and (b) breaks existing programs is certainly NOT an
impossibility.

Please remember the A in ANSI stands for American, as does the A in
ASCII. In an effort to make this a viable international standard, X3J11
may not have considered the impact of this implementation.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

rcd@ico.ISC.COM (Dick Dunn) (05/25/88)

Thanks to Doug Gwyn for some answers on trigraphs.  Unfortunately, the more
I learn, the less I like them...but that's not Doug's fault.  >=me, >>=Doug

> >1.  Replacement within strings:  This is a change to the existing language.
> >    It breaks existing programs.  ...
> >    Point:  The sequence "??" is not at all rare.
> Trigraphs ARE relatively rare in existing code.  Yours is the first
> example I've seen, in fact.  Most applications think ? should be used
> as a question mark in messages, perhaps ?? at the end of a few message
> strings or in a chess program.

Wait.  I said that ?? (the trigraph introducer, if you will) is not at all
rare, and this is easy to confirm.  Occurrences of ?? are important because
they represent situations where the next character could cause trouble.

Go look at source code!  If you're on a UNIX system, find some source and:
	find . -name '*.[ch]' -exec grep '??' '{}' ';'
I suggest that you look for all ?? instead of just trigraphs so that you
can get an appreciation of where ?? appears.

When I first found trigraphs, I said "WTF??!" and immediately looked at my
own source code.  I found one conflict.  So I went to a UNIX source tree
and found several occurrences in Sys V code.  More poking around turned up
scattered others--some netnews source, some networking stuff.  There
aren't a lot of them, but they *do* exist.

I would have expected the committee to do as I did--search large piles of
source code to look for conflicts.  It only took me a little while one
evening.  Some repeats--??! as an expletive; (???) for a questionable item.

The following is NOT meant as a flame against Doug (who has stuck his neck
out to explain some of what has gone on), but I think the committee reneged
on its responsibilities in putting trigraphs in.  From the X3J11 rationale:

| The X3J11 charter clearly mandates the Comittee to _codify_existing_
| _practice_.  (emphasis present; "_" is italics)
|  ...
| Existing code is important.
|  ...
| Avoid "quiet changes."

Trigraphs are not existing practice; apparently they have not even been
really tried out!  They break existing code in a "quiet change" fashion.
There are real examples of code currently in use which will be "broken"
if recompiled by a compiler conforming to this part of the draft standard.

> >    What I don't understand is why it was decided to
> >    introduce a brand-new (I assume) mechanism which breaks existing code.
> Because nobody, including you, has proposed anything that the Committee
> agreed was better,...

I intentionally avoided any sort of counterproposal in the first posting
because I wanted to focus on what the committee had done and why; I didn't
want to start with a debate over anything I would propose.

I have a philosophical view that this problem would be better off with
no solution than with a clumsy solution that breaks existing code.  (I
don't agree that "a bad solution is better than none at all.")  There are
other areas where X3J11 said "there's no prior art" and/or deferred work on
a problem to extension work.

Trigraphs in strings are the important issue; trigraph symbols in code are
ugly but don't break anything.  So, just for the sake of argument I'll toss
out some ideas for strings:  There is already one form for an alternate
interpretation of the mapping of a literal character or string into its
memory representation, namely L"stuff" for wide chars and strings.  Why not
use the same model--say, precede the string with R for restricted or T for
trigraph; thus R"stuff??/n" would mean R"stuff\n".  Even if you think
L"stuff" is a mistake, this would only be a second occurrence of the same
class of mistake.  (Karl Heuer noted that L"stuff" is a quiet change too,
but it's highly unlikely to hit; I've found no occurrences.)

As I said, that was JUST a proposal for the sake of argument.  You might
equally well construct names for the problem characters and build them
into a header file; then construct strings by the compile-time concate-
nation business.  There are other ways.  YES, they're ugly, BUT they don't
have to break existing code, while the draft standard method is ugly AND
breaks code.

What about an ISO 8859 character set?  Wouldn't that cover a lot of the
problem area?

>...and many C users (for example, Europeans) have a
> perceived need that the parochial American outlook does not meet.

I understand their need.  I agree that it's "parochial" to ignore the
problem, but I don't think it's parochial to say "we don't have a good
solution yet, so let's not cast a bad one in concrete."  =>What do Europeans
do about C now?<=  Is there NO prior art?  If not, it's certainly not ready
to be standardized!

> >Has the trigraph mechanism been tried out, in real practice, anywhere
> >prior to the introduction in X3J11?
> This specific mechanism is an invention of X3J11, so far as I can
> determine.  However, use of multi-byte sequences to encode things
> that cannot be represented by a single byte is extremely common
> practice.

I know that multi-byte sequences are common--I worked with 370ish Pascal
quite a while back, and we had to use digraphs for about six characters.
These digraphs became part of the Pascal standard, BUT there's a big
difference: the digraphs were established practice long before the
standard was done.  They were in use, known to be practical (if ugly),
and didn't break anything on machines that didn't need them.

It is also clear that you don't get very far trying to invent believable
digraphs for C, so you need trigraphs if you go that route.  The objection
is that they haven't been tried out.  You're standardizing something you
haven't really used in practice, and since C is not Ada (oops; sorry:-),
that's just not wise.

> Note, by the way, that I oppose trigraphs, but I can provide a definite
> explanation of how the European needs can be met without them...

Then I wish folks had pushed against them harder.  (Maybe you did, Doug; I
don't know.)
-- 
Dick Dunn      UUCP: {ncar,cbosgd,nbires}!ico!rcd       (303)449-2870
   ...If you get confused just listen to the music play...

henry@utzoo.uucp (Henry Spencer) (05/26/88)

> I countersuggest that the compiler should always recognize trigraphs, and
> issue a warning message if any are encountered.  Then add an option to supress
> this warning.

Actually, in the experimental scanner I'm playing with, they are always
recognized, but how they are interpreted depends on an option.  If the
trigraph option is on, they are interpreted as per X3J11.  If the option
is off -- the default -- a warning message is produced and each trigraph
is interpreted as three characters.

henry@utzoo.uucp (Henry Spencer) (05/26/88)

> Consider:
> 	#trigraph ??

You can't even write this without trigraphs, because # is one of the magic
characters that may not exist in the source character set.

I don't much like trigraphs, and I think there are more graceful approaches
(like saying "use ISO Latin 1", which eliminates the problem), but you can
be fairly sure that X3J11 has already thought of all the simplistic quick
fixes and turned them down for one reason or another.

henry@utzoo.uucp (Henry Spencer) (05/26/88)

> ... What is the justification for breaking existing programs when the
> ability to include untypeable characters into strings already
> exists via the \xxx mechanism?  Instead of introducing a totally new notion
> (to C, anyway) of trigraphs, why not simply extend the backslash escape...

Because, for openers, backslash is one of those ASCII-specific characters
that you can't even *write* without trigraphs in some of the European
character sets.  I do wish people who want to sound off about this problem
would first spend some time understanding it!

ok@quintus.UUCP (Richard A. O'Keefe) (05/26/88)

In article <5424@ico.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes:
> Wait.  I said that ?? (the trigraph introducer, if you will) is not at all
> rare, and this is easy to confirm.  Occurrences of ?? are important because
> they represent situations where the next character could cause trouble.

I just checked a directory containing 126 utility sources (some of which I
got from the net, some of which I got from a friendly wizard years ago) and
4 of them contained ?? inside strings.  If I've understood the rules, only
two of them would actually break.  Rather alarming:  before I made this
check I was happy about trigraphs:  they won't break _my_ code, I said!

johnl@n3dmc.UUCP (John Limpert) (05/26/88)

I think a simple solution to this problem is possible.  Why not
have the compiler print a warning if it detects a trigraph?
This would reduce the chances of breaking a program when it was
recompiled with an ANSI C compiler.  If the warning was only
printed the first time a trigraph was encountered, it wouldn't
be too annoying.  Restricting the check to literal strings might
be worthwhile.  Many compilers print warnings about constructs
that are legal, but are often unintentional coding errors.

-- 
John A. Limpert
UUCP:	johnl@n3dmc.UUCP	uunet!n3dmc!johnl
PACKET:	n3dmc@n3dmc.ampr.org	n3dmc@wa3pxx

faustus@ic.Berkeley.EDU (Wayne A. Christopher) (05/26/88)

Nobody has said what the existing practice is with regard to European
character sets.  Do Europeans just use an ascii keyboard when they want
to use C?  Or do they use u-umlaut for backslash (or whatever it is)?
Trigraphs are so ugly I can't believe anybody actually uses them, or
will use them if they're part of C.

I think trigraphs are a trick of American terminal manufacturers who
want to fool Europeans into thinking they can use their terminals for
writing programs.

	Wayne

flaps@dgp.toronto.edu (Alan J Rosenthal) (05/26/88)

In article <5391@ico.ISC.COM> rcd@ico.ISC.COM (Dick Dunn) writes:
[ re a compiler switch for trigraphs ]
>If the programmer writes:
>	printf("What on earth??!\n");
>a standard-conforming compiler should produce code which will cause the
>program to print:
>	What on earth|
>If instead it produces code which causes the program to print:
>	What on earth??!
>it's violating the standard.

Just goes to show you that conforming compilers are not likely to be
the most useful compilers.  However, the ansi standard doesn't say that
the conforming compiler must be called `cc'.  Most implementors will
probably say that it's called `cc -trigraph' (with probably other
switches as well).  Or, to put it another way, I fully expect all
ansi-conforming compilers to come in two flavours:  a strictly
conforming one and a useful one.

>If you're compiling your own code, you know when to turn on the trigraph
>switch on the compiler...but if you're compiling jrandom.c that you got on
>a tape from somebody, what do you do?  Is it a standard program?  Was it
>written before the standard came out?  (There are a couple of files in the
>netnews source which fall into just this hole.)

Ahem, all C programs were written before the standard comes out.  The
standard has not yet come out.

Anyway, to answer your question, you simply compile it without the
-trigraph switch and also without the -nocomplaintrigraph switch.  In
other words, a useful compiler will not implement trigraphs but will
give a warning message when it encounters them.

One might still argue that you then have to decide whether to recompile
with the -trigraph switch or not.  However I maintain that this problem
exists whether or not there is a compiler switch to solve it.

ajr

--
- Any questions?
- Well, I thought I had some questions, but they turned out to be a trigraph.

meissner@xyzzy.UUCP (Michael Meissner) (05/26/88)

In article <11655@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
| 
| I think the `#trigraph' suggestion is a suitable way to keep trigraphs
| from affecting old code and/or infesting new code.

Unfortunately, the problem with #trigraph and others of it's ilk, is
that '#' is one of the characters replaced in European 7-bit character
sets.
-- 
Michael Meissner, Data General.

Uucp:	...!mcnc!rti!xyzzy!meissner
Arpa:	meissner@dg-rtp.DG.COM   (or) meissner%dg-rtp.DG.COM@relay.cs.net

gwyn@brl-smoke.ARPA (Doug Gwyn ) (05/26/88)

In article <10941@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
>	#trigraph ??

This is a nice idea, assuming that one remains committed to having the
compiler deal with trigraphs at all, which in my opinion was never
necessary even for European C users.

There have been further ISO developments affecting character sets
since the invention of trigraphs, so it might be appropriate to
rexamine this invention to see whether or not it could be totally
removed.  However, at this stage of the approval process, the only
way I can imagine a substantive change to the trigraph specs would
be for a serious objection ("veto") to be raised at the ISO level.
X3J11 has indicated a desire for the next round of public review
to be the last, which it cannot be if substantive changes are made.

>  PLEASE X3J11, fix this sucker! It CAN be done without breaking
>existing programs.  It makes more sense in the preprocessor.  Best
>reason is that as specified it will lead to compilers which don't do
>full ANSI by default, or even subset compilers. 

Trigraph mapping is specified as being done in translation phase 1,
which precedes what is normally considered "preprocessing", but could
certainly be handled by separate preprocessors.  I think your proposal
could be fit into the translation-phase scheme adequately if it were
accepted by the committee.

I don't really think there will be any compilers that will fully
conform to all ANSI/ISO C specs, except for trigraph handling, as
the default case.  Much more likely is that there would be separate
PCC-like and ANSI-conforming compilers (perhaps controlled by a
command-line "switch").

gwyn@brl-smoke.ARPA (Doug Gwyn ) (05/27/88)

In article <5424@ico.ISC.COM> rcd@ico.ISC.COM (Dick Dunn) writes:
>Thanks to Doug Gwyn for some answers on trigraphs.  Unfortunately, the more
>I learn, the less I like them...but that's not Doug's fault.

Thanks for recognizing that I don't like them either and am just trying
to explain what I think X3J11's motivation/reasoning was.  Of course,
I'm not speaking officially for X3J11 here and may have gotten this
wrong (the original decision was made before I started attending meetings).

One thing to keep in mind is that almost everyone agrees that it is
important for the ANSI and ISO standards for C to be technically
identical.  Therefore X3J11 is dealing with internationalization
issues, even though this might seem unnecessary for ANSI purposes.

>| Avoid "quiet changes."

The proposed ANSI/ISO C standard introduces several "quiet changes",
as noted in the Rationale document.  Certainly one guideline was to
minimize these, but there were many guidelines and they conflicted
to some degree.  Therefore compromises had to be worked out; if it
makes you feel better, call these "optimal solutions to constrained
problems" instead of "compromises".

>There are real examples of code currently in use which will be "broken"
>if recompiled by a compiler conforming to this part of the draft standard.

Yes, that's true for all "quiet changes".

>I have a philosophical view that this problem would be better off with
>no solution than with a clumsy solution that breaks existing code.

I don't think "no solution" was considered acceptable to ISO at the
time.

>... R"stuff??/n" would mean "stuff\n".

This is not a bad idea, but as the proposed standard stands trigraphs are
mapped well before anything else is done to analyze the source code, so
the "??/" would not hang around long enough for this method to be applied.
If it weren't for the need to deal with {} etc. then trigraph mapping
could possibly be deferred, but the main use of trigraphs is for {} etc.
so the mapping cannot be deferred long enough.

>What about an ISO 8859 character set?  Wouldn't that cover a lot of the
>problem area?

It was considered inappropriate for the C standard to constrain the choice
of character set like that.  However, it recently was revised to promise
that '0' through '9' have ascending numerical representations, and of
course it does require that a large set of characters be representable,
so there is some precedent.  I doubt that enough vendors would support
such a requirement, though.

The ISO 646-1983 invariant code set was taken as the least common
denominator for respresentable character glyphs.  I think that was the
real mistake; glyphs are just silly marks on paper or displays, and we
aren't really interested in their shapes other than that all of the ones
we need for C be unique.  I don't much care if { sometimes looks like [[
or \(lb, so long as I have tools for dealing with it when I program.

>What do Europeans do about C now?

The only existing practice I had heard about was use of (< >) etc.,
details varying from place to place.  Perhaps some Europeans can
contribute more info here.

>Then I wish folks had pushed against them harder.

Two factors conspired here.  One is that many existing
environments don't offer much support for better source code
import/export/printing translation, which is how I think this
issue should be dealt with.  The other is that "ISO insisted on
this sort of solution", which may or may not be true but it
certainly makes it hard to deal with since X3J11 and the ISO C
people don't meet concurrently.

ado@elsie.UUCP (Arthur David Olson) (05/27/88)

> I think a simple solution to this problem is possible.  Why not
> have the compiler print a warning if it detects a trigraph?

No. . .the commpiler should only print a warning if a particular file contains
both trigraphs *and* trigraphable characters.  This way folks who write
"pure trigraph" code won't get inundated with warnings.  The scheme is
effective since '#' is a trigraphable character.
-- 
	ado@ncifcrf.gov			ADO is a trademark of Ampex.

minow@thundr.dec.com (Martin Minow THUNDR::MINOW ML3-5/U26 223-9922) (05/27/88)

In a message to comp.lang.c, fuastus@ic.Berkeley.edu asks what Europeans
use to write C programs.

As you no doubt know, there are about a dozen code positions in "Ascii" that
are reserved for national use.  The C language uses most of these for
syntactic purposes.  X3J11 invented the "trigraph" notation to allow
C programming on European terminals without the (current) kludge of
interpreting, say, "upper-case A-umlaut" as "left square bracket".

The problem only occurs for terminals that are limited to a single
seven-bit ISO-646 based character set.  EBCDIC terminals, and terminals
that conform to the newer ISO 8859 (Latin-1) or that are compatible with
Dec's VT200 series can use a coherent 8-bit character set that permits
C programming in its current form without loss of national characters.
Central to this is operating system support for 8-bit characters.  Some
operating systems (and utilities) assume that the eighth bit is free
for "flagging" which causes problems.

Although ISO 8859 is the best base for future programming, it should be
noted that non-ISO workstations such as the IBM PC, the Atari St and the
Macintosh support a mixture of national letters and the ISO invariant set.

The only problem, then, is caused by "old-style" terminals combined with
seven-bit limited operating environments.  At the time trigraphs were
proposed, these were fairly common.  They are much less common now,
and are quickly being replaced by ISO-compliant terminals and workstations.

Imagine if C were being standardized in, say, 1974, when there were very
few terminals that supported lower-case:  one could well imagine a kludge
to allow mixed case programming on monocase terminals.  One such kludge
was, in fact, provided in the Unix operating system.  It finds little,
if any, use today -- and you would have to search carefully to find
an upper-case only terminal.

Because of the speed of conversion to ISO-8859 (and similar 8-bit
environments), coupled with ambiguities in the definition of trigraphs,
I recommended in my comments to the standard that they be dropped.
The committee rejected my arguments, but I would hope they reconsider
before release of the standard.

Martin Minow
minow%thundr.dec@decwrl.dec.com

PS: there was some question of "American Chauvism".  For the record,
I have a European university degree, and worked as a programmer in
Europe for ten years.

The above does not represent the position of Digital Equipment Corporation

henry@utzoo.uucp (Henry Spencer) (05/28/88)

> [warning of trigraphs]  Restricting the check to literal strings might
> be worthwhile...

Practical but a bit of a nuisance.  Ideally one would like to use the
same code for the checking and (if enabled) the actual interpretation
of trigraphs.  This is almost impossible to do if one wants to be
selective about issuing warnings, because trigraph interpretation is
defined to happen at a time when you don't even know whether you're
inside a comment or not, never mind what kind of token you're examining.
-- 
"For perfect safety... sit on a fence|  Henry Spencer @ U of Toronto Zoology
and watch the birds." --Wilbur Wright| {ihnp4,decvax,uunet!mnetor}!utzoo!henry

thorinn@diku.dk (Lars Henrik Mathiesen) (05/28/88)

In article <3655@pasteur.Berkeley.Edu> faustus@ic.Berkeley.EDU (Wayne A. Christopher) writes:
>Nobody has said what the existing practice is with regard to European
>character sets.
I posted an article the other day, but it maybe it didn't get past mcvax.
I shall include it here.

>I think trigraphs are a trick of American terminal manufacturers who
>want to fool Europeans into thinking they can use their terminals for
>writing programs.
Think again: If we use American ASCII-only terminals on an operating system
and compiler designed for ASCII, as most of them are, there's no problem
in writing C code, only in getting our national characters in the output.
I think a similar confusion may be part of the reason why trigraphs are so
badly concieved.

My prior article follows; I apologize if it's been seen before, but I
haven't seen any signs that it has.

As one who regularly uses a non-ASCII terminal setup, I'd better explain
a little. In Danish (my native language) we have three `extra' letters
which we much prefer to use when writing Danish text - it is possible to
get by with two-letter replacements, but it's not very readable. By the
way, these are not `accented letters;' they are separate letters of the
alphabet, with their own place at the end of the sorting sequence. Much
the same applies to German, Swedish, Norwegian, and many other European
languages.
  That's not usually a problem as most modern terminals have provisions
for various national character sets, which are defined in an ISO standard.
This standard allows the glyphs at some eight or ten positions to vary,
including @, $, [, \, ], {, | and }. The latter six are used for the non-
ASCII letters in Danish, as they follow the other letters nicely.

  So, the X3J11 people think, the poor Europeans can't use ASCII: we'll
have to invent some kludge to bring C to their benighted shores. The only
excuse for inventing something so horrible is that it only breaks a very
few programs, and that it won't be used anyway.
  You see, over here we get by just fine without trigraphs. The less
fortunate are stuck with a national character set, and have to put up
with seeing the various punctuation as letters - they are not as visually
distinctive (and the brackets and braces don't pair naturally), but with
a little attention to layout one gets by quite well. And it's _much_ better
than trigraphs.
  The lucky ones have terminals which can switch between ASCII and national
character sets. If not for the warped minds of the terminal manufacturers,
this would be the perfect solution. But we (at this institute) have yet to
see a terminal with an escape sequence to switch character sets, or (and
this is worse) one whose keyboard layout did _not_ change with the character
set shown on the screen. (And none of them had LCD keytops). So we have to
pay the importer to hack new PROMs to enable us to switch without moving the
keys around. But I digress.
  By the way, I find that it's easier to read Danish with ASCII characters
than it is to parse convoluted C code in Danish characters, so I hardly
ever bother to switch any more.

To make it pleasant to use C and national letters in the same file, there
would have to be _convenient_ replacements for the ASCII characters in
question, and it would have to allow the national letters to be used in
identifiers (trigraphs don't). This cannot be done as an extension of the
ASCII C input format because the national letters are punctuation in ASCII.
  Now we're talking about an alternate input format for C - we'll have to
tell the compiler if a given source file is in the `old' or the `new' format.
On the other hand this frees us to use extra keywords etc. The new format
shouldn't use any characters that may be replaced in national character sets.
  The tokens [ ] { } | || (and in some compilers |=) must be replaced; one
off-the-cuff possibility is (. .) beg end or cor (or=). We need a new
pre-processor escape and a new string escape, which can't very well be
keywords. // might be a possibility for both, as it's rare in C, but does
it look too much like JCL?
  This new format could probably be implemented by a little lex pre-pre-
processor; national characters in identifiers would have to be encoded
somehow (e.g. using Q as an escape), increasing the identifier length.
This would cause problems with symbolic debuggers and short-name compilers,
but could easily be retrofitted on old compilers (write your own cc ...).
Oh well, it wouldn't be portable anyway. Hey, anybody from GNU reading this?

By the way, Standard Pascal is designed to be possible to write without
specific ASCII characters: It allows (. .) for [ ] (indexing), and (* *)
for { } (comments). Since e.g. .5 is a legal constant, this may cause
unexpected parse errors for programmers who're unaware of the feature.
--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark      [uunet!]mcvax!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.

nather@ut-sally.UUCP (Ed Nather) (05/29/88)

In article <8805271311.AA12359@decwrl.dec.com>, minow@thundr.dec.com (Martin Minow THUNDR::MINOW ML3-5/U26 223-9922) writes:

[much clearly stated wisdom omitted]
> 
> Because of the speed of conversion to ISO-8859 (and similar 8-bit
> environments), coupled with ambiguities in the definition of trigraphs,
> I recommended in my comments to the standard that they be dropped.
> The committee rejected my arguments, but I would hope they reconsider
> before release of the standard.
> 

So would I.  The many, many negative comments about trigraphs on the net,
some from Europeans who would be expected to "benefit" from this new,
ugly and totally untested idea, say it is not just bad, but very bad.

Why mess up a fine job (according to dmr) of standardizing by quietly
introducing something that is so ugly it will never be used?

Of course, compilers which comply with the new standard might be
advertized as "Not Including Trigraphs" to gain sales, in the same
way the ads say "Not Copy Protected."

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nather@astro.AS.UTEXAS.EDU

bill@proxftl.UUCP (T. William Wells) (05/29/88)

In article <5215@ico.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes:

> [lots of stuff demonstrating how trigraphs break existing code].

>     Replacing all the nasty characters with corresponding trigraphs gives:
>
>       if (line??(0??)=='??=' ??!??! line??(0??)=='%') ??<
>               prepro(&line??(1??));
>               linect++;
>       ??>

Ugh.  How horrible.  However, I imagine that few programmers will
actually have to cope with this.  As you suggest, the effort of
using the trigraphs would not be well rewarded; however,
mechanical translation of programs without the trigraphs into
those with trigraphs would permit compilation of existing
programs (and those written offline) on a machine without the
characters.

> A general question:  Has the trigraph mechanism been tried out, in real
> practice, anywhere prior to the introduction in X3J11?  If so, I'd like to
> hear about how it's worked out.

I remember all to well writing APL on a machine that had two
kinds of terminals: those with the APL character set and those
without; digraphs were used for entry using the latter.  I also
remember the intense competition to get the terminals with the
APL set.  BUT, we did write an awful lot of code with the
digraphs.

bts@sas.UUCP (Brian T. Schellenberger) (05/30/88)

In article <10949@apple.Apple.Com> lenoil@apple.UUCP (Robert Lenoil) writes:
|In article <5215@ico.ISC.COM> rcd@ico.ISC.COM (Dick Dunn) writes:
|>    Note also that it is common practice to use "?" in initializing strings
|>    where the "?" positions will be replaced at execution time.
|
|Dick is dead right here.  What is the justification for breaking existing
|programs when the ability to include untypeable characters into strings already
|exists via the \xxx mechanism?  Instead of introducing a totally new notion
|(to C, anyway) of trigraphs, why not simply extend the backslash escape
|mechanism to be valid outside of strings?  This would allow the use of #defines
|to perform the same function as trigraphs:
|
|#define ??< \173	/* open brace */
|#define ??> \175	/* close brace */

No, you are both DEAD WRONG here.  This will break badly on the IBM, PR1ME,
and other non-ASCII machines.

You should *NEVER* assume anything (that the ANSI C standard doesn't guarantee)
about the character set in portable programs.  And if your program isn't
intended to be portable, ANSI is irrelevent anyway.
-- 
--Brian, the man from
  Babble-on.                |Brian T. Schellenberger| ...!mcnc!rti!sas!bts     |
                            |104 Willoughby Lane    |work: (919) 467-8000 x7783|
                            |Cary, NC   27513       |home: (919) 469-9389      |

aledm@cvaxa.sussex.ac.uk (Aled Morris) (05/30/88)

In article <10949@apple.Apple.Com>, lenoil@Apple.COM (Robert Lenoil) writes:
> Instead of introducing a totally new notion
> (to C, anyway) of trigraphs, why not simply extend the backslash escape
> mechanism to be valid outside of strings?

I strongly agree with this proposal.  Trigraphs introduce a totally new
feature into the language, which is going to take some getting used to.
I can see some bugs creeping into my strings (and I bet they wont be in
the strings that get used very often, so they won't be easy to spot!)

Just one minor problem---isn't the backslash character one of the glyphs
missing from the Invarient Code Set?  Ah well....

Aled Morris

Janet/Arpa: aledm@uk.ac.sussex.cvaxa   |   School of Cognitive Science
      uucp: ..!mcvax!ukc!cvaxa!aledm   |   University of Sussex
      talk: +44-(0)273-606755  e2372   |   Falmer, Brighton, England
  "I'm living in the future/I feel wonderful/I'm tipping over backwards...
I'm so ambitious/I'm looking back/I'm running a race/and your the book i read"

karl@haddock.ISC.COM (Karl Heuer) (06/01/88)

The story so far: X3J11/ISO says that trigraphs have to exist because some
important character sets don't include symbols like "#".

However, some external representation of this character has to exist anyway.
After all, I can do putc('#', outf) to a text stream and read it back in,
whereupon it must compare equal to '#'; hence there is already some mapping,
independent of trigraphs, between the source character set and the external
character set.  Why can't the translator use this mapping instead of
trigraphs?

Example: suppose I don't have '#' but I do have at least one character which
is not part of ISO 646 (say, '$').  When writing to a text stream, in addition
to possibly mucking around with newlines I convert '#' to the digraph '$='.  I
do the opposite conversion on input.  There is no '$' in the source character
set.  My compiler and text editor are both written in portable C, and neither
knows about this translation (only the stdio library does).  There's no need
for '$' to even be printable.

Rebuttal, anyone?

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

boby@pyuxf.UUCP (robert yaeger) (06/01/88)

In article <3655@pasteur.Berkeley.Edu>, faustus@ic.Berkeley.EDU.UUCP writes:
> Nobody has said what the existing practice is with regard to European
> character sets.  Do Europeans just use an ascii keyboard when they want
> to use C?  Or do they use u-umlaut for backslash (or whatever it is)?
> Trigraphs are so ugly I can't believe anybody actually uses them, or
> will use them if they're part of C.
> 
> I think trigraphs are a trick of American terminal manufacturers who
> want to fool Europeans into thinking they can use their terminals for
> writing programs.

Well just to let you know, trigraphs are indeed needed in the good ol' USA.
Try writing MVS/c programs using a 3270! Fortunately, the only trigraphs
needed are the ??( and ??) ( ie., [ and ] ). 

The practice we've adopted is to code trigraphs only when declaring arrays.
All references to these arrays in the code use ptr arithmetic. 
This contains the ugliness of them to the declare sections.

-- 

Bob Yaeger 
uucp : ...!inhp4!bellcore!pyuxf!boby 
phone: 1-201-699-5128

gwyn@brl-smoke.ARPA (Doug Gwyn ) (06/01/88)

In article <4314@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes:
>However, some external representation of this character has to exist anyway.
>After all, I can do putc('#', outf) to a text stream and read it back in,
...
>Rebuttal, anyone?

How can it be rebutted?  It's exactly correct, and is why I think
trigraphs were unnecessary in the first place.

Note: the code might be written "putc('??=', outf)" but it's still
a distinct character represented in the proposed C standard by the
glyph "#".  Sites that want to import strictly conforming programs
have to be able to handle non-trigraph sources anyway.  Trigraphs
are a (poor) solution to the wrong problem.

minow@thundr.dec.com (Martin Minow THUNDR::MINOW ML3-5/U26 223-9922) (06/02/88)

Perhaps one of the trigraphs experts could anser a simple question:
Suppose I've written a fully-compliant C compiler (that handles trigraphs)
that I sell to my friend in Visby, Sweden who needs trigraphs since his
language has national letters replacing the "[\]{|}" of US ASCII.  He writes
his first program as:

	??= include <stdio.h>
	main() ??<
		printf("H{lsningar fr}n Visby p} \land!??/n");
	??>

When he runs my compiler, How does it know that the charcter whose value
is decimal 92 is a national letter, and not a backslash that crept in?
Do I need command line arguments or a ??=pragma?  Are they permitted by
the standard?

Will all ??=include files be required to be distributed in their
trigraphed format?

Martin Minow
minow%thundr.dec@decwrl.dec.com

karl@haddock.ISC.COM (Karl Heuer) (06/03/88)

In article <343@pyuxf.UUCP> boby@pyuxf.UUCP (robert yaeger) writes:
>Well just to let you know, trigraphs are indeed needed in the good ol' USA.
>Try writing MVS/c programs using a 3270! Fortunately, the only trigraphs
>needed are the ??( and ??) ( ie., [ and ] ).

And what, pray tell, do you see on your terminal if you run the program
  #include <stdio.h>
  main() { printf("??(??)\n"); }

>The practice we've adopted is to code trigraphs only when declaring arrays.
>All references to these arrays in the code use ptr arithmetic.

I once wrote a program using a certain style because it happened to look
better on the one printer that was then available.  I soon regretted that
decision.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

karl@haddock.ISC.COM (Karl Heuer) (06/03/88)

In article <8806021259.AA21135@decwrl.dec.com> minow@thundr.dec.com (Martin Minow THUNDR::MINOW ML3-5/U26 223-9922) writes:
>[A Swedish user] writes his first program as:
>	??= include <stdio.h>
>	main() ??< printf("H{lsningar fr}n Visby p} \land!??/n"); ??>
>When he runs my compiler, How does it know that the charcter whose value
>is decimal 92 is a national letter, and not a backslash that crept in?
>Do I need command line arguments or a ??=pragma?  Are they permitted by
>the standard?

It's up to the implementation to specify the character set.  You could have
one translator which believes `\' is a backslash, and a different one which
believes it's a national letter.  You can select which of these two
implementations is to compile the program by using a command-line argument.

>Will all ??=include files be required to be distributed in their
>trigraphed format?

It isn't necessary; you could supply a different set of include files with the
two implementations.  (E.g. `cc -{' could mean `interpret {|}[\] as national
characters and use /usr/include/swedish/*.h, while `cc +{' means `interpret
them as punctuation and use /usr/include/ascii/*.h'.)

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

ok@quintus.UUCP (Richard A. O'Keefe) (06/03/88)

In article <343@pyuxf.UUCP>, boby@pyuxf.UUCP (robert yaeger) writes:
> Well just to let you know, trigraphs are indeed needed in the good ol' USA.
> Try writing MVS/c programs using a 3270! Fortunately, the only trigraphs
> needed are the ??( and ??) ( ie., [ and ] ). 

The irony of this is that the manufacturer's (IBM's) character set (EBCDIC)
*does* include codes for "[" and "]", it's just that a lot of their
equipment doesn't quite support their own character set.

The pre-ANSI method used in the SAS C compiler ("(|" for "[" and
"|)" for "]") strikes me as far more readable, and neither combination
is otherwise legal C.

gwyn@brl-smoke.UUCP (06/05/88)

In article <8805261740.AA00659@explorer.dgp.toronto.edu> flaps@dgp.toronto.edu (Alan J Rosenthal) writes:
>Or, to put it another way, I fully expect all ansi-conforming compilers
>to come in two flavours:  a strictly conforming one and a useful one.

I've already demonstrated that trigraph mapping is virtually a non-problem,
since accidental trigraph sequences in existing code are quite rare.

As long as we're guessing about the future, what I expect to see on many
systems is a choice (perhaps via "switches", more usefully just as a
separate name for the compile command) between
	(a) backward-compatible C, probably with most of the newer
		non-conflicting Standard C features
and	(b) fully-conforming Standard C.

A vendor who tries to modify (b) to provide the vendor's notion of what
is "useful" will not be selling any compilers to me, since I will need
full (b) for my strictly conforming applications.  That is what having
a standard is all about.

The main reason for (a) on UNIX systems would be to support Reiser cpp
abuse, which many programmers have been guilty of.  Otherwise, Standard
C is pretty much upward compatible with old Random C.

karl@haddock.ISC.COM (Karl Heuer) (06/06/88)

In article <8805261740.AA00659@explorer.dgp.toronto.edu> flaps@dgp.toronto.edu (Alan J Rosenthal) writes:
>In other words, a useful compiler will not implement trigraphs but will
>give a warning message when it encounters them.

If it issues warning messages, why should the compiler bother to implement the
old meaning?  I'd think that the set of users who have programs containing
accidental trigraphs *and* can look at a compiler warning without wanting to
make it go away (by fixing their programs) would be very small.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

boby@pyuxf.UUCP (robert yaeger) (06/06/88)

In article <4393@haddock.ISC.COM>, karl@haddock.UUCP writes:

> >Try writing MVS/c programs using a 3270! Fortunately, the only trigraphs
> >needed are the ??( and ??) ( ie., [ and ] ).
  
> And what, pray tell, do you see on your terminal if you run the program
>   #include <stdio.h>
>   main() { printf("??(??)\n"); }
> 
  The answer is 

::
 
(: is the 3270 representation for an unprintable character).

> >The practice we've adopted is to code trigraphs only when declaring arrays.
> >All references to these arrays in the code use ptr arithmetic.
  
> I once wrote a program using a certain style because it happened to look
> better on the one printer that was then available.  I soon regretted that
> decision.
> 

I don't see the connection here, if you decide to use trigraphs instead of
ptr arithmetic then it won't matter what printer you use, the code will
always be ugly, and hard to maintain. 

BTW, there other solutions, 

  1. you can hard code the EBCDIC codes, ie x'ad' and x'bd' but these show
     up as unprintables on the 3270 and are also hard to edit after they 
     are embedded in the source. This is what was done before trigraphs.

  2. you can use an APL terminal which does support these characters.
     As another posting has pointed out these chars are in EBCDIC but are
     not supported on the 3270 terminal.

-- 

Bob Yaeger 
uucp : ...!inhp4!bellcore!pyuxf!boby 
phone: 1-201-699-5128