[net.lang.c] Length of external names

jeff@alberta.UUCP (C. J. Sampson) (12/31/84)

> >  and object file format don't accept long or case-sensitive names.  What
> >  does the proposed ANSI standard say about the issue?
> 
> The current draft says that the length limit (if any) and treatment
> of case in external identifiers are "implementation-defined", which
> means that implementors can do things as they wish but must document
> their decisions.  Also, the length limit may not be shorter than 6.

Gads!  When are they going to figure out that 6 or 8 characters is *not*
enough.  I spent three hours porting ogre to an Altos 586 running some
ancient verson of Xenix and most of that was spent changing function
names because I had only 7 signifcant characters.  I think that the standard
should enforce a minimum of 32 characters.  We will make programs more
portable and readable.
-------------------------------------------------------------------
C. J. Sampson			Snail Canada: #712 11135-83rd ave.
ihnp4!     \				Edmonton, Alberta
ubc-vision! |- alberta!jeff		CANADA  T6G 2C8
sask!      /			Phone: (403) 439-6851

david@ukma.UUCP (David Herron, NPR Lover) (01/02/85)

> From: jeff@alberta.UUCP (C. J. Sampson)
> Newsgroups: net.lang.c
> Subject: Re: length of external names
> Message-ID: <380@alberta.UUCP>
> Date: Sun, 30-Dec-84 21:38:19 EST
> 
> > The current draft says that the length limit (if any) and treatment
> > of case in external identifiers are "implementation-defined", which
> > means that implementors can do things as they wish but must document
> > their decisions.  Also, the length limit may not be shorter than 6.
> 
> Gads!  When are they going to figure out that 6 or 8 characters is *not*
> enough.  I spent three hours porting ogre to an Altos 586 running some
> ancient verson of Xenix and most of that was spent changing function
> names because I had only 7 signifcant characters.  I think that the standard
> should enforce a minimum of 32 characters.  We will make programs more
> portable and readable.

But if we enforce a minimum size then they will be portable only
within the systems that support that size.  I think "implementation-defined"
is the way to go.  At least for now.

--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:-
David Herron;  ARPA-> "ukma!david"@ANL-MCS
(Try the arpa address w/ and w/o the quotes, I have had much trouble with both.)

UUCP          -:--:--:--:--:--:--:--:--:-          (follow one of these routes)

{ucbvax,unmvax,boulder,research} ! {anlams,anl-mcs} -----\  vvvvvvvvvvv
							  >-!ukma!david
   {cbosgd!hasmed,mcvax!qtlon,vax135,mddc} ! qusavx -----/  ^^^^^^^^^^^

henry@utzoo.UUCP (Henry Spencer) (01/02/85)

> > The current draft says that the length limit (if any) and treatment
> > of case in external identifiers are "implementation-defined", which
> > means that implementors can do things as they wish but must document
> > their decisions.  Also, the length limit may not be shorter than 6.
> 
> Gads!  When are they going to figure out that 6 or 8 characters is *not*
> enough.  I spent three hours porting ogre to an Altos 586 running some
> ancient verson of Xenix and most of that was spent changing function
> names because I had only 7 signifcant characters.  I think that the standard
> should enforce a minimum of 32 characters.  We will make programs more
> portable and readable.

Oh lord, not this again...  This topic was discussed *to death* a few
months ago.  To summarize the major points that emerged:

- There are many systems which are doomed to live with old, brain-damaged
	linker formats.  Manufacturers have too big a commitment to the
	old formats to change, and their users have no say in the matter.
	It is politically vital for the acceptance of the standard that
	standard-conforming implementations be possible on such machines.
	This is regrettable but impossible to avoid.

- Trying to pick a number other than 6 is silly.  People who have a choice
	about the number can just as easily opt for no limit at all, which
	is clearly the right decision.  People who do not have a choice
	about the number generally are stuck with a rather low number,
	typically 6.

- Software which relies on long names is not fully portable, regardless of
	claims to the contrary.

- It is generally agreed that the situation is unsatisfactory and painful.

- I repeat a challenge I made at the time:  if you think a mandatory bigger
	number is appropriate despite the problems this will cause for the
	more backward systems, prove your point by convincing DEC or IBM
	to agree with you.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

jeff@alberta.UUCP (C. J. Sampson) (01/03/85)

> > [ Names are not long enough, etc. etc. ]
>
> But if we enforce a minimum size then they will be portable only
> within the systems that support that size.  I think "implementation-defined"
> is the way to go.  At least for now.

The idea is that all standard systems will support a minimum size that is 
reasonably large.  "implementaton-defined" sizes will make C programs no more
portable then they are now in that respect.  What is the point of a standard
if it does not make programs written to the standard more portable?  I still
say that we should have minimum 32 character externs.  Porting ~500 lines an
hour just because of this is very expensive as well as very stupid.
=====================================================================
	Curt Sampson		ihnp4!alberta!jeff
---------------------------------------------------------------------
"It looked like something resembling white marble, which was probably
 what is was: something resembling white marble."

cottrell@nbs-vms.ARPA (01/04/85)

/*
> Gads! When are they going to figure 6 or 8 chars is not enuf?

i hate long names. what is this anyway, cobol? why say
'social_security_number' when you can say 'ssn'? personally, i would like
to see variable names restricted to 3 chars exactly :-)

*/

henry@utzoo.UUCP (Henry Spencer) (01/04/85)

> The idea is that all standard systems will support a minimum size that is 
> reasonably large.

If you can convince people like IBM and DEC to go along with this and
change their object-module formats to match, the entire C community will
be forever indebted to you.  That is *not* a facetious comment; we cannot
afford to ignore the major manufacturers when we are talking about making
something really standard.  If they don't accept it, then we have a two-
level standard in practice even if it's not so in theory.  And I see no
chance whatsoever that they are going to change their object-module formats
at this late date.  None, zero.  Give it up, it's hopeless.

And just how much acceptance would you expect for a standard that none of
the major manufacturers can comply with?

> "implementaton-defined" sizes will make C programs no more
> portable then they are now in that respect.  What is the point of a standard
> if it does not make programs written to the standard more portable?

I hate to tell you this, but the current drafts have a substantial appendix
listing all the "implementation-defined" characteristics of a conforming
implementation.  Identifier length is only one of a longish list.

Even if the standard does not make programs more portable -- and it will,
it will -- it prevents future compiler writers from making them still
*less* portable.  "The difference between bad and worse is much sharper
than the difference between good and better."

> I still
> say that we should have minimum 32 character externs.  Porting ~500 lines an
> hour just because of this is very expensive as well as very stupid.

Imagine how some of us PDP11 people feel when we try to port 4BSD programs
to our machines; the stupidities are not limited to long names.  ("Malloc
never fails, so we needn't bother checking its return value.")  ("Of course
ints are 32 bits, everywhere.  The whole world's a VAX, after all.")  ("I
know I'm really supposed to use %ld to print a long, but who cares?  It's
the same size as an int, so I'll just use %d.")  The only cure for this
sort of malignant imbecility is more care by the original author.  Porting
unportably-written software is always going to be hard.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

Paul Schauble <Schauble@MIT-MULTICS.ARPA> (01/06/85)

I'm not sure that I should post this to the net, but I can't resist..

Henry Spencer, who seems to be one of the chief exponents of short
external names, just posted a convincing explaination of the need to not
break existing linkers.  I understand why and the issues involved.  I
even mostly agree.  In a previous incarnation I worked on COBOL and PL/1
for a manufacturer that had the same problem:  a language that required
long names and a linker that only handled short ones.

The solution that was used, and worked, was to have the COMPILER use the
external "name" to store a hashed value.  During the recent net
discussion I posted a description of this technique and some analysis of
the chance and cost of collisions.

This is done entirely in the compiler, and has no effect on the linker.

I have not seen any reasonable statement of why this would not be
workable.  The only objection that I can recall was that having to look
up the name translation during debugging was extra work.  True, but
consider...Would you rather have the extra work on the few occasions
that you need to look up a symbol on the load map, or on the many more
frequent occasions that you are dealing with C source and have to guess
what "dtfmdu" or something means?  You know which way I will vote.

More recent discussion prompts me to post a small modification of the
technique.  Several people have pointed out the desirability of a
language feature that would have the internal and external names of a
global item be different, e.g.

          extern int date_and_time() entry "SYS$TIME";
          extern int memory_size entry "CSYS$MEMSIZ";

I like this, other languages have it, it's useful, and it would have
saved me having to write a number of assembler routines whose only
purpose was to change names.

It also allows me to suggest a modification of the hashing technique.
Note that this only applies to systems with deficient linkers.

If the declaration contains an entry clause, use that as an external
name.

Otherwise, if the item name is short enough, use the item name.

Otherwise, hash the item name and use the result as the external name.

This allows programming using the full names, and using the entry clause
for those cases where you really care what the external name is, or in
the rare cases when the hash causes a duplication of external names.

----------------------------------------------------------------------

Now, my questions:

   To the standards commiteee poeple:

1.  Suppose that the standard required longer names and suggested the
    hashing technique as an implementation technique, you would force
    manufacturers to update either linker or compiler to meet the
    standard.  Is this politically possible?

2.  In some other areas, I am told, the standard described a relatively
    high level language, rather than the mimimum of implementations.
    This will prevent some present compilers from meeting the standard.
    Why should it pick the mimimum here?

3.  How can I get a copy of the draft standard?

4.  Is this an adequate method of getting comments and questions to the
    committee? If not, what is a useful channel?

    To the net at large:

1.  What are specific objections to the hashing technique?

2.  Are there any machines where it won't work, and why?


Please copy me on any answers.  Service from the list has been erratic
lately.

          Thanks for all the fish...

          Paul
          Schauble@MIT-Multics.ARPA

henry@utzoo.UUCP (Henry Spencer) (01/08/85)

> Henry Spencer, who seems to be one of the chief exponents of short
> external names, just posted a convincing explaination of the need to not
> break existing linkers. ...

To rebut a misconception:  I don't like short external names.  I merely
think that (a) some provision for them in the standard is inevitable,
and (b) annoying though this is, we can live with it, which is a passing
grade for a standard that has to apply to everyone.

> [A] solution that was used, and worked, was to have the COMPILER use the
> external "name" to store a hashed value.  During the recent net
> discussion I posted a description of this technique and some analysis of
> the chance and cost of collisions.

I don't recall seeing the previous posting about this, but the problem of
collisions is definitely a nasty one.  Bearing in mind that separately-
compiled modules must agree on the object-file (i.e. short) name under
which an identifier is known, the possibility of collisions is a major
flaw in a hashing scheme.  I've worked with compilers that did similar
things (first 4 and last 3 chars of the identifier, as I recall) and one
had to be careful about collisions; it really wasn't much better than
short identifiers.  If the algorithm used is really a hashing function
rather than a systematic "cut and paste" rearrangement of the original
identifier, collisions become (a) less likely, and (b) harder to spot and
deal with.

Note that hashing *demands* a way to force an internal-to-external
correspondence, like the proposed "entry" clause, for linking to system
services and other languages.

I like the idea of using an "entry" clause to manage correspondences
between internal/long and external/short names, although if you ignore
the issue of identifiers containing funny characters, you can do exactly
the same thing with #define.  (Note that preprocessor identifiers are
internal, hence must be long.)

I am not a member of the committee, but will comment on some of the
suggestions addressed to them...

> 1.  Suppose that the standard required longer names and suggested the
>     hashing technique as an implementation technique, you would force
>     manufacturers to update either linker or compiler to meet the
>     standard.  Is this politically possible?

I don't know.  If the problem of collisions can be shown to be a non-issue,
and the "entry" clause or something like it can be introduced, it might be
viable.  It depends on how manufacturers feel about hashing.

> 2.  In some other areas, I am told, the standard described a relatively
>     high level language, rather than the mimimum of implementations.
>     This will prevent some present compilers from meeting the standard.
>     Why should it pick the mimimum here?

Because the problems go much farther than the compiler.  Object-module
formats are visible system-wide, making changes much harder.

> 3.  How can I get a copy of the draft standard?

I believe the draft has gone to ANSI for publication for formal public
comment; it should be available from CBEMA (don't have the address handy)
shortly.  The price will be unpleasant, though, knowing CBEMA.  I don't
know whether the older informal channels are still open.

> 4.  Is this an adequate method of getting comments and questions to the
>     committee? If not, what is a useful channel?

Some of the committee folks definitely do read this newsgroup.  If you
want to be forceful about something, though, the recommended course is
to write (on a piece of paper) to them.  The transition to ANSI formal-
public-comment phase may have altered this, though.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

jans@mako.UUCP (Jan Steinman) (01/08/85)

In article <6951@brl-tgr.ARPA> cottrell@nbs-vms.ARPA writes:
>...personally, i would like to see variable names restricted to 3 chars
>exactly :-)

Or better yet, a single upper-case character, ('I' preferred) followed by
exactly four digits! ::-)
-- 
:::::: Jan Steinman		Box 1000, MS 61-161	(w)503/685-2843 ::::::
:::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

Paul Schauble <Schauble@MIT-MULTICS.ARPA> (01/09/85)

One other comment on the hashing technique:  When I made the original
posting I assumed the linker model I was most familiar with:  one
external definition and a series of references.  For this model, having
two C symbols that hash to the same external is not very much of a
problem.  The linker will see two different definitions of its symbol
and should complain.

The numbers given also assumed the shortest linker name space I was
aware of, 30 bits.  For anything larger (GCOS, 36 bits, MS-DOS, 64
bits), the probability of collision is too small to compute on the
equipment I have at hand.

          Paul

bright@dataio.UUCP (Walter Bright) (01/09/85)

> The solution that was used, and worked, was to have the COMPILER use the
> external "name" to store a hashed value.  During the recent net
> discussion I posted a description of this technique and some analysis of
> the chance and cost of collisions.
> 
> This is done entirely in the compiler, and has no effect on the linker.
> 
>     To the net at large:
> 
> 1.  What are specific objections to the hashing technique?

a)	Reading linker maps would be terrible.
b)	All the other tools that depend on the global symbol table
	would be messed up. So, you say, rewrite the tools so they
	inverse hash the symbols. So then it would be easier to just
	fix the linker, and we're back where we started.

	My solution to the external symbol dilemma is that it should
	be implementation-defined, since the behavior is determined
	by the linker and the compiler writer typically has no
	control over the linker.

	If code is being ported to a machine with a smaller linker,
	the programmer could 'hash' the overly long externals himself
	with macros.

wdr@faron.UUCP (William D. Ricker) (01/10/85)

>From: Paul Schauble <Schauble@MIT-MULTICS.ARPA>

>.
>:
>The solution that was used, and worked, was to have the COMPILER use the
>external "name" to store a hashed value.  During the recent net
>discussion I posted a description of this technique and some analysis of
>the chance and cost of collisions.

>This is done entirely in the compiler, and has no effect on the linker.

>.
>:
>More recent discussion prompts me to post a small modification of the
>technique.  Several people have pointed out the desirability of a
>language feature that would have the internal and external names of a
>global item be different,  ...

>If the declaration contains an entry clause, use that as an external
>name.
>Otherwise, if the item name is short enough, use the item name.
>Otherwise, hash the item name and use the result as the external name.

-----------

One interpretative language I'm familiar with uses a similar hashing
scheme.  (This ties in with the suggestion of 7chars & length, as was
used in PL/I.) The length, initial three characters, and a hash-code of
1-31 character identifiers where used as the internal name.  In the
special case of length=4, the hash-code is the fourth character, also
compressed.

In reality, the length and 4 characters are compressed into 4 bytes.
This is possible due to the limited character set for identifiers.  The
interpreter unpacked the length, initials and hash when the structure
was displayed (in debugging or listing what routines were loaded).  It
even altered the format to distinguish visually between

	4 frob

	7 fro b

to empashize that "frob" hashes itself, "4frob", but "frobble" hashes to
"7frob".

I'm not sure what the character set was, nor bit assignments.  (I could
look it up at home if anyone cares.) It might have been 5 bits for the
length (1-31) and 6 bits for the compressed initials and hash--but it
wasn't the SIXBIT character set.  For some reason, I think the number
of bits for the 3-char compression (perfect hash) was not divisible by
three, though, and the 4th char/hash was compressed separately.  17
bits would represent 128k combinations, which would represent 3
characters from a 50-character font; 16 bits suffices for 3 alphamerics
(40-character font: [A-Z0-9@$#_]).


-- 

  William Ricker
  wdr@MITRE-Bedford.ARPA					(MIL)
  wdr@faron.UUCP						(UUCP)
  decvax!genrad!linus!faron!wdr					(UUCP)
 {allegra,ihnp4,utzoo,philabs,uw-beaver}!linus!faron!wdr	(UUCP)

Opinions are my own and not necessarily anyone elses.  Likewise the "facts".

mike@enmasse.UUCP (Mike Schloss) (01/11/85)

> One other comment on the hashing technique:  When I made the original
> posting I assumed the linker model I was most familiar with:  one
> external definition and a series of references.  For this model, having
> two C symbols that hash to the same external is not very much of a
> problem.  The linker will see two different definitions of its symbol
> and should complain.

This will work fine for one object module and one or more libraries,
but what about multiple object modules??? Like when you compile a
kernel, shell, or other large (multi source file) utility.

> The numbers given also assumed the shortest linker name space I was
> aware of, 30 bits.  For anything larger (GCOS, 36 bits, MS-DOS, 64
> bits), the probability of collision is too small to compute on the
> equipment I have at hand.

The probability is too small... Does this mean it will never occur.
Would you like to find the bug that an unreported collision will
cause if (when) it does happen.  Or, would this be the first place
you would look if you did have a elusive bug.

P.S.  Assuming hashing is used...
A possible solution to finding this rare bug would be to recompile
everything (libraries and all) using an alternate hashing function.

mike@enmasse.UUCP (Mike Schloss) (01/11/85)

> >...personally, i would like to see variable names restricted to 3 chars
> >exactly :-)
> 
> Or better yet, a single upper-case character, ('I' preferred) followed by
> exactly four digits! ::-)
> -- 
> :::::: Jan Steinman		Box 1000, MS 61-161	(w)503/685-2843 ::::::
> :::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

Or how about a single upper-case character only for numbers and
a '$' followed by a single upper-case character only for characters?

jack@vu44.UUCP (Jack Jansen) (01/13/85)

>> >...personally, i would like to see variable names restricted to 3 chars
>> >exactly :-)
>> Or better yet, a single upper-case character, ('I' preferred) followed by
>> exactly four digits! ::-)
>Or how about a single upper-case character only for numbers and
>a '$' followed by a single upper-case character only for characters?
Yeah! This sounds great! And let's add *mandatory* line numbers, so
it will be much simpler to discuss programs! And lets call this
discussion "Beauty And Style Into 'C'!!".
\
 \
:-)
 /
/
-- 
	Jack Jansen, {seismo|philabs|decvax}!mcvax!vu44!jack
	or				       ...!vu44!htsa!jack
Help! How do I make a cup of tee while working with an acoustic modem?

Schauble@MIT-MULTICS.ARPA (Paul Schauble) (11/06/86)

A couple of months back, I was involved in a fairly active tirade about
the length of external names in the C standard.  I believed then, and
still do, the the proposed standard's length of 8 characters in
inadequate.  This minimum will become a maximum for anyone wanting to
write portable code.

Now, I don't want to reopen the argument here.  I am very curious,
however, as to why that limit was established.  The only reason I can
come up with is to accommodate limitations in somebody's linker.  But
who?

The last machine I am aware of that had a short name restriction in the
linker was Honeywell's GCOS line.  They now have a new linker with a 500
character limit.

I have reason to suspect that there are no current machines and
operating systems with a very short limit.  Reason being the the COBOL
standard requires 30 character names, and that forced most manufacturers
to update their linkers.

So, I am asking for information.  Are there any current production
machines and operating systems with a linker that will not accept 30
character external names?

By current production I mean one that is actively supported by new
software, such that one could reasonably expect it to get an ANSI C
compiler.

Please reply directly to me.  I will post results in two weeks.  If you
know of such a machine, please provide me my counterexample.

          Thanks,
          Paul
          Schauble at MIT-Multics.ARPS