[mod.std.c] mod.std.c Digest Volume 4 : Issue 12

osd7@homxa.UUCP (Orlando Sotomayor-Diaz) (03/11/85)
From: Orlando Sotomayor-Diaz (The Moderator) <cbosgd!std-c>


mod.std.c Digest            Mon, 11 Mar 85       Volume 4 : Issue  12 

Today's Topics:
              Character escapes for deficient terminals
                      Long external identifiers
                 preprocessor issues by Larry Rosler
----------------------------------------------------------------------

Date: Sat, 9 Mar 85 23:55 EST
From: Paul Schauble <ucbvax!Schauble@MIT-MULTICS.ARPA>
Subject: Character escapes for deficient terminals
To: cbosgd!std-c@BERKELEY

In a recent posting, Doug Moen tabulated the differences between the VAX
PCC C compiler and the proposed standard. One of his listings bothers me
badly:


> A set of trigraphs have been defined to allow C programs to be entered
> on terminals that don't support the full ASCII character set [B.2.1]:
>           ??=       #
>           ??(       [
>           ??/       \
>           ??)       ]
>           ??'       ^
>           ??<       {
>           ??!       |
>           ??>       }
>           ??-       ~
> These trigraphs are even interpreted within string constants,
> so that "??=" is identical to "#".  [B.1.1.2]


Part of the botheration is that it would have broken the program I was
working on just before reading this. It used ??(programname) as the
prefix for all error messages.

Specifically, I really don't want to see yet another special set of
characters. We already have one escape character defined within strings,
\, let's just use that. Instead of the trigraphs, why not \( and so on?
Everybody already knows that \ is special.

    Paul

[ Some terminals don't have the \ key either.  See the trigraph list
  above.  -- Mod -- ]

------------------------------

Date: Sun, 10 Mar 85 01:02 EST
From: Paul Schauble <ucbvax!Schauble@MIT-MULTICS.ARPA>
Subject: Long external identifiers
To: cbosgd!std-c@BERKELEY

I don't believe that a 6 character limit is even close to acceptable.
Other people have given more than adequate reasons.  I would like to
suggest a possible solution for those machines which haven't yet updated
their linker.


This note proposes an alternate method for dealing with the problem of
long (>6 character) external identifiers. It is entirely contained
within the C compiler and should work with any linker.

Note that this discussion assumes a linker model with one external
definition and multiple external references. If the linker accepts
multiple external definitions (the labeled common model), some of the
numbers will change, but the basic argument will not.

First, one extension has been proposed on Info-C that I would like to
strongly second.  That is to reactivate the "entry" keyword and use it
to allow the internal and external names of a global item be different,
e.g.

          extern int date_and_time() entry "SYS$TIME";
          extern int memory_size entry "CSYS$MEMSIZ";

I like this, other languages have it, it's useful, and it would have
saved me having to write a number of assembler routines whose only
purpose was to change names.

Not having seen the proposed standard, I don't know what notation it
uses for the language. The intention here is that the clause

     identifier ENTRY "external name"

can appear following an identifier in any context where the identifier
would be visible outside of the current compilation. The identifier is
used within the current compilation and the contents of the quoted
string used as the external name. No syntax is specified for the quoted
string, except that normal \ processing is performed. Of course, if no
entry clause is given, the external name is the same as the internal.

This solves two problems. First, as above, it allows referencing
external names that contain oddball characters, such as $, !, and ), all
of which I have seen used.

Second, it allows finessing the long external problem:

        extern very_long_internal_function_name entry "flff"();

by separating the characteristics of internal and external identifiers.
The external restrictions apply only if the programmer chooses not to
use entry.

In summary, I believe that this feature offers enough that it should be
adopted on its own merits.

Now, go one step further and adopt these rules for choosing the external
name:

  - If the declaration contains an entry clause, that gives the external
    name.

  - Otherwise, if the internal name meets all of the requirements for
    external names, use the internal name.

  - Otherwise, generate an external name by hashing the internal name.

The hash function should be one that considers every character in the
internal name, not e.g. take the first 3 and last 3.

This hash function will map long identifiers onto whatever name
space the linker provides.  This mapping will be a many to one mapping,
but if the linker name space is at all reasonable and the number of
externals reasonably few, collisions would be rare.  Have the compiler
hash each external and use the hashed value as the linker name.
Collisions between two names within a module can be found by the
compiler and fixed by rehashing the name.  Collisions between names in
two different modules will not be caught by the compiler.  These will
result in duplicate definitions of the external, and should be caught by
the linker.  The only undetected error is to have an external reference
in a module set that is missing the external definition (already an
error) such that the reference hashes to the same name as another
definition.

I assume that the compiler can be made to print a list of C language
name to linker name and vice versa so that the programmer can read
linker maps.  It's extra work, but that's what you get for using an
ancient linker. It's less work than trying to remember what "dtmfdu"
means everywhere in your C code.

Note that this technique has been used successfully by several other
compilers.

This way you won't tie the entire future of the C language to those
machines left over from the 60's.  Please don't do that.  This solution
allows those with good linkers to use them, those with poor linkers to
limp along, and provides encouragement to improve the linker breed.


A little analysis:
     The most primitive linker I can immediately think of coded the
names into a 32 bit word, using 6 characters from an alphabet of 40.
(Isn't this what the PDP11 does, rather than radix 50??)  This provides
a name space of roughly 4.3e9 linker names.  Given a good hashing
function, a module set containing 100 externals has 1.2 chances per
million such SETS in having a DETECTED collision.  Having an undetected
collision, as noted above, first requires a programming error.

Assume that the useful lifetime of such machines is twenty years beyond
the publication of the standard (seems like plenty).  Assume that new
and different module sets as described above are developed for machines
as described above using standard compilers at the rate of one such set
per calendar day.  Over twenty years, that's 7300 module sets.  We can
expect that .0084 of these contain detected errors.  Assume that it
takes 100 programmer-hours at $50/hour to repair these collisions.
That's $5,000/collision or $42 over the twenty years.

That's the expected cost to the programming community of using this
scheme.

I am ignoring the cost of adding the hashing to the compilers since the
number of C compilers << number of C programmers.

Now, the benefits of the longer names are more readable source.  This
shows up in faster program development and faster and easier
maintenance.  Let's say that having longer externals that are the same
as internal labels saves 1% of programmer time (low, I think).

Conclusion:  Seems to me that the standard should mandate long externals
if it can be shown that the universe covered by the standard will spend
more than $4200 on C program development and maintenance over the next
20 years.


----------------

Perhaps I have gone off of the deep end here.  I think not.  Of course,
I welcome any logical rebuttals or comments.

          Paul Schauble
          Schauble@MIT-Multics

P.S.
    One drawback that I haven't heard pointed out:  If external names
have the same specs as internal, then it is possible to change an item
to an external just by changing the declaration.  This seems to happen
fairly often when redesigning an existing system and wanting to save
code.  You can't do this if the externals have to be shorter than
internals.

Also, I would like to second the viewpoint that whatever length is
allowed, the entire name should be significant.  Allowing names longer
and ignoring the extra characters, in my experience, is very dangerous
and the source of some very subtle and nasty bugs.

------------------------------

Date: Sun, 10 Mar 85 13:38:45 pst
From: ucbvax!ucsfcgl!arnold (Ken Arnold)
Subject: preprocessor issues by Larry Rosler
To: @ucbvax.BERKELEY:cbosgd!std-c

>The scanning of strings for embedded identifiers was something the
>committee simply could not accept.  A string is a token, and what
>is inside a string has no grammar from which an identifier can be
>derived in any definable way.  Once again the committee had to be
>convinced that substituting macro arguments in strings was worthwhile,
>and then an acceptable method had to be INVENTED.

A practical comment:

The statement that an identifier can't be found in a string in any
definable way is simply false.  I can write a regular expression which
will define the identifiers inside (or outside) a string.  A regular
expression is quite specific and well defined.  It is a very clear,
defined method of explication.  Thus, finding identifiers in strings,
or defining what this means, is not a problem.

Looking for them is not a problem either.  The preprocessor part of the
compiler (embedded or not) knows when it is expanding a macro and when
it is not.  When it is expanding a macro with parameters, it can look
in a string for identifiers defined as above.  This is neither difficult
nor theoretically unclear.  I fail to see the committee's problem with
adding this feature which exists in many implementations.

I agree with the comments about comments and white space.  The C
standard's extension is not evil, and I will admit that it is somewhat
more powerful, but the two methods are not mutually exclusive.



		Ken Arnold
=================================================================
Of COURSE we can implement your algorithm.  We've got this Turing
machine emulator...

------------------------------

End of mod.std.c Digest - Mon, 11 Mar 85 12:08:32 EST
******************************
USENET -> posting only through cbosgd!std-c.
ARPA -> ... through cbosgd!std-c@BERKELEY.ARPA (NOT to INFO-C)
In all cases, you may also reply to the author(s) above.