osd7@homxa.UUCP (Orlando Sotomayor-Diaz) (03/11/85)
From: Orlando Sotomayor-Diaz (The Moderator) <cbosgd!std-c> mod.std.c Digest Mon, 11 Mar 85 Volume 4 : Issue 12 Today's Topics: Character escapes for deficient terminals Long external identifiers preprocessor issues by Larry Rosler ---------------------------------------------------------------------- Date: Sat, 9 Mar 85 23:55 EST From: Paul Schauble <ucbvax!Schauble@MIT-MULTICS.ARPA> Subject: Character escapes for deficient terminals To: cbosgd!std-c@BERKELEY In a recent posting, Doug Moen tabulated the differences between the VAX PCC C compiler and the proposed standard. One of his listings bothers me badly: > A set of trigraphs have been defined to allow C programs to be entered > on terminals that don't support the full ASCII character set [B.2.1]: > ??= # > ??( [ > ??/ \ > ??) ] > ??' ^ > ??< { > ??! | > ??> } > ??- ~ > These trigraphs are even interpreted within string constants, > so that "??=" is identical to "#". [B.1.1.2] Part of the botheration is that it would have broken the program I was working on just before reading this. It used ??(programname) as the prefix for all error messages. Specifically, I really don't want to see yet another special set of characters. We already have one escape character defined within strings, \, let's just use that. Instead of the trigraphs, why not \( and so on? Everybody already knows that \ is special. Paul [ Some terminals don't have the \ key either. See the trigraph list above. -- Mod -- ] ------------------------------ Date: Sun, 10 Mar 85 01:02 EST From: Paul Schauble <ucbvax!Schauble@MIT-MULTICS.ARPA> Subject: Long external identifiers To: cbosgd!std-c@BERKELEY I don't believe that a 6 character limit is even close to acceptable. Other people have given more than adequate reasons. I would like to suggest a possible solution for those machines which haven't yet updated their linker. This note proposes an alternate method for dealing with the problem of long (>6 character) external identifiers. It is entirely contained within the C compiler and should work with any linker. Note that this discussion assumes a linker model with one external definition and multiple external references. If the linker accepts multiple external definitions (the labeled common model), some of the numbers will change, but the basic argument will not. First, one extension has been proposed on Info-C that I would like to strongly second. That is to reactivate the "entry" keyword and use it to allow the internal and external names of a global item be different, e.g. extern int date_and_time() entry "SYS$TIME"; extern int memory_size entry "CSYS$MEMSIZ"; I like this, other languages have it, it's useful, and it would have saved me having to write a number of assembler routines whose only purpose was to change names. Not having seen the proposed standard, I don't know what notation it uses for the language. The intention here is that the clause identifier ENTRY "external name" can appear following an identifier in any context where the identifier would be visible outside of the current compilation. The identifier is used within the current compilation and the contents of the quoted string used as the external name. No syntax is specified for the quoted string, except that normal \ processing is performed. Of course, if no entry clause is given, the external name is the same as the internal. This solves two problems. First, as above, it allows referencing external names that contain oddball characters, such as $, !, and ), all of which I have seen used. Second, it allows finessing the long external problem: extern very_long_internal_function_name entry "flff"(); by separating the characteristics of internal and external identifiers. The external restrictions apply only if the programmer chooses not to use entry. In summary, I believe that this feature offers enough that it should be adopted on its own merits. Now, go one step further and adopt these rules for choosing the external name: - If the declaration contains an entry clause, that gives the external name. - Otherwise, if the internal name meets all of the requirements for external names, use the internal name. - Otherwise, generate an external name by hashing the internal name. The hash function should be one that considers every character in the internal name, not e.g. take the first 3 and last 3. This hash function will map long identifiers onto whatever name space the linker provides. This mapping will be a many to one mapping, but if the linker name space is at all reasonable and the number of externals reasonably few, collisions would be rare. Have the compiler hash each external and use the hashed value as the linker name. Collisions between two names within a module can be found by the compiler and fixed by rehashing the name. Collisions between names in two different modules will not be caught by the compiler. These will result in duplicate definitions of the external, and should be caught by the linker. The only undetected error is to have an external reference in a module set that is missing the external definition (already an error) such that the reference hashes to the same name as another definition. I assume that the compiler can be made to print a list of C language name to linker name and vice versa so that the programmer can read linker maps. It's extra work, but that's what you get for using an ancient linker. It's less work than trying to remember what "dtmfdu" means everywhere in your C code. Note that this technique has been used successfully by several other compilers. This way you won't tie the entire future of the C language to those machines left over from the 60's. Please don't do that. This solution allows those with good linkers to use them, those with poor linkers to limp along, and provides encouragement to improve the linker breed. A little analysis: The most primitive linker I can immediately think of coded the names into a 32 bit word, using 6 characters from an alphabet of 40. (Isn't this what the PDP11 does, rather than radix 50??) This provides a name space of roughly 4.3e9 linker names. Given a good hashing function, a module set containing 100 externals has 1.2 chances per million such SETS in having a DETECTED collision. Having an undetected collision, as noted above, first requires a programming error. Assume that the useful lifetime of such machines is twenty years beyond the publication of the standard (seems like plenty). Assume that new and different module sets as described above are developed for machines as described above using standard compilers at the rate of one such set per calendar day. Over twenty years, that's 7300 module sets. We can expect that .0084 of these contain detected errors. Assume that it takes 100 programmer-hours at $50/hour to repair these collisions. That's $5,000/collision or $42 over the twenty years. That's the expected cost to the programming community of using this scheme. I am ignoring the cost of adding the hashing to the compilers since the number of C compilers << number of C programmers. Now, the benefits of the longer names are more readable source. This shows up in faster program development and faster and easier maintenance. Let's say that having longer externals that are the same as internal labels saves 1% of programmer time (low, I think). Conclusion: Seems to me that the standard should mandate long externals if it can be shown that the universe covered by the standard will spend more than $4200 on C program development and maintenance over the next 20 years. ---------------- Perhaps I have gone off of the deep end here. I think not. Of course, I welcome any logical rebuttals or comments. Paul Schauble Schauble@MIT-Multics P.S. One drawback that I haven't heard pointed out: If external names have the same specs as internal, then it is possible to change an item to an external just by changing the declaration. This seems to happen fairly often when redesigning an existing system and wanting to save code. You can't do this if the externals have to be shorter than internals. Also, I would like to second the viewpoint that whatever length is allowed, the entire name should be significant. Allowing names longer and ignoring the extra characters, in my experience, is very dangerous and the source of some very subtle and nasty bugs. ------------------------------ Date: Sun, 10 Mar 85 13:38:45 pst From: ucbvax!ucsfcgl!arnold (Ken Arnold) Subject: preprocessor issues by Larry Rosler To: @ucbvax.BERKELEY:cbosgd!std-c >The scanning of strings for embedded identifiers was something the >committee simply could not accept. A string is a token, and what >is inside a string has no grammar from which an identifier can be >derived in any definable way. Once again the committee had to be >convinced that substituting macro arguments in strings was worthwhile, >and then an acceptable method had to be INVENTED. A practical comment: The statement that an identifier can't be found in a string in any definable way is simply false. I can write a regular expression which will define the identifiers inside (or outside) a string. A regular expression is quite specific and well defined. It is a very clear, defined method of explication. Thus, finding identifiers in strings, or defining what this means, is not a problem. Looking for them is not a problem either. The preprocessor part of the compiler (embedded or not) knows when it is expanding a macro and when it is not. When it is expanding a macro with parameters, it can look in a string for identifiers defined as above. This is neither difficult nor theoretically unclear. I fail to see the committee's problem with adding this feature which exists in many implementations. I agree with the comments about comments and white space. The C standard's extension is not evil, and I will admit that it is somewhat more powerful, but the two methods are not mutually exclusive. Ken Arnold ================================================================= Of COURSE we can implement your algorithm. We've got this Turing machine emulator... ------------------------------ End of mod.std.c Digest - Mon, 11 Mar 85 12:08:32 EST ****************************** USENET -> posting only through cbosgd!std-c. ARPA -> ... through cbosgd!std-c@BERKELEY.ARPA (NOT to INFO-C) In all cases, you may also reply to the author(s) above.