[gnu.gcc] Is your system polluted?

rfg@ICS.UCI.EDU (12/22/89)

As part of the work I'm doing on protoize/unprotoize, I decided that it would
be a good idea to be able to find out (for any given system) what the
names of all of the functions declared in system include files are.
I wrote the following script to do part of the job.

The results that I got from running this script on one system are very
saddening.  It appears that (for some systems at least) there is an awful
lot of pollution of various name spaces contained in the system include
files.  Specifically, there are lots of clashes of names where one name
is used for two (or more) different things in two (or more) different
include files.  This means that you may/will get errors if particular
pairs of include files are included into the same base file. :-(

For those of you who may want to know how polluted your own system's name
space is, I suggest that you try to run the following script and see what
happens.  I would be very interested to hear about results for various
systems.

All this script does is to make a big .c file that includes all of your
system's include files.  It then tries to compile the whole batch with gcc.
Now it seems that this ought to be an acceptable thing to do, but my
prediction is that (on most systems) many errors with arise during the
compilation.

Note that I have hacked out all of the filenames that I had to put into the
definition of the symbol DELETIONS in order to get the compile to complete
without error.  I did this to protect the guilty.

// rfg


#!/bin/sh

# Many include files require other include files to be included first
# The following is a list of such files.  This list may need to be
# tailored (i.e. added to, or reordered) for your particular system.
# There is generally no need to delete entries that don't apply to your
# particular system because if any one of the entries doesn't exist on
# your system, it will simply be ignored.

LEADERS="/usr/include/sys/types.h \
	/usr/include/rpc/types.h \
	/usr/include/rpc/auth.h \
	/usr/include/rpc/xdr.h \
	/usr/include/sys/socket.h \
	/usr/include/ufs/vnode.h \
	/usr/include/sys/ipc.h \
	/usr/include/rpc/clnt.h \
	/usr/include/rpcsvc/yp_prot.h \
	/usr/include/pwd.h \
	/usr/include/limits.h \
	/usr/include/utmp.h \
	/usr/include/stdio.h
	/usr/include/regexp.h"

# Note that for well behaved systems, it should be possible to include all of
# the include files which reside beneath /usr/include into one single file.

# If this is not true on your system, then that's definitely a problem.
# You should probably get either your sysadmin or your vendor to fix
# up your system include files.  Otherwise, you can include the names
# of offending files in the list of files *not* to be included here.

DELETIONS=""

# Some files may require odd things to be defined before they can be included.
# For example, regexp.h needs to have ERROR and INIT both defined to null
# string.

# If your include files need specific things defined, specify them here.

DEFINES="-DINIT= \
	-DERROR="


# Don't change anything below this line.
##############################################################################

LOCAL_INCLUDES=`find /usr/include -type f -name \*.h -print`

rm -f temp-include.c temp-include.s
rm -f leaders.h followers.h

echo '#include "leaders.h"' > temp-include.c
echo '#include "followers.h"' >> temp-include.c

# Find all the leader files that actually exist on this system, and write them
# (in the order given above) into the leaders.h file.

for leader_file in $LEADERS; do
	for include_file in $LOCAL_INCLUDES; do
		if [ "$include_file" = "$leader_file" ] ; then
			echo '#include "'$include_file'"' >> leaders.h
			break
		fi
	done
done

# Find all the non-leaders, and write them (in proper order) into the
# followers.h file.

for include_file in $LOCAL_INCLUDES; do
	IS_LEADER=0
	for leader_file in $LEADERS; do
		if [ "$include_file" = "$leader_file" ] ; then
			IS_LEADER=1
			break
		fi
	done
	if [ $IS_LEADER -eq 0 ] ; then
		echo '#include "'$include_file'"' >> followers.h
	fi
done

# kludge around problem files by not including them

for deletion in $DELETIONS; do
	fixed=`echo $deletion | sed 's%/%\\\\/%g'`
	echo /$fixed/d >> sed.script
done
sed -f sed.script followers.h > /tmp/followers.h
cp /tmp/followers.h followers.h
rm -f sed.script /tmp/followers.h

gcc $DEFINES -S temp-include.c
rm -f temp-include.c temp-include.s
rm -f leaders.h followers.h

pcg@aber-cs.UUCP (Piercarlo Grandi) (12/23/89)

In article <8912211630.aa04575@ICS.UCI.EDU> rfg@ICS.UCI.EDU writes:
    
    As part of the work I'm doing on protoize/unprotoize, I decided that it would
    be a good idea to be able to find out (for any given system) what the
    names of all of the functions declared in system include files are.
    I wrote the following script to do part of the job.
    
    The results that I got from running this script on one system are very
    saddening.  It appears that (for some systems at least) there is an awful
    lot of pollution of various name spaces contained in the system include
    files.  Specifically, there are lots of clashes of names where one name
    is used for two (or more) different things in two (or more) different
    include files.  This means that you may/will get errors if particular
    pairs of include files are included into the same base file. :-(

Actually things are even worse than Ron Guilmette says. Not only a lot
of second rate hackers put duplicate names in system headers, but they do
the following things as well:

	1) internal kernel entities are declared in headers for application
	use. A very bad offender here is System V.3.2, some BSD versions
	make an attempt at least to bracket these within #ifdef KERNEL
	#endif (which is still unsatisfactory).

	2) a more generic problem is that a lot of user level packages
	declare in the headers also entities that are only used internally
	to it.

	3) even worse, a lot of libraries contain externals that are not
	declared static. This is very dangerous, because you may unwittingly
	use the same name in your program, and then all hell breaks loose. A
	particularly bad offender is curses.

In C++ this is less troublesome as you can stuff things within the walls of
a class, and their scope will then be local to it. Except for typedefs,
unfortunately, but at least C++ 2.0 allows encapsulation of enums (and class
names, but that is virtually unavoidable).

In C, where we don't have a proper modularization facility, the following
guidelines ought to be followed:

	1) All global entities declared by a module should start with a well
	advertised module prefix, including #defines, procedure, variables,
	enums, structs, typdefs,... This has already been partially done with
	existing libraries, e.g. for prefixes 'str', 'f' (stdio), 'w'
	(curses), but usually in a half baked way. As a solution it is not
	complete, in that you may have then clashes of prefixes, but at
	least the problem becomes an order of magnitude less severe. In C++
	this is done by putting as much as possible within class boundaries.

	2) File names should also start with the modules prefix, both
	headers and sources. Such names can be either of the form
	<prefix><suffix>.h (e.g. StreamIn.h, StreamOut.h, StreamRw.h) or
	<prefix>/<suffix>.h (e.g. Inet/Udp.h, Inet/Tcp.h, ...), depending
	usually on their number (or the length of the name under System V).

	3) Published headers should contain only the client interface of a
	module. Actually, for sophisticated modules, the client interface
	should be split in several headers, each containing only a subset,
	of entities likely to be used together. Eschew all inclusive header
	files (e.g. like "builtin.h" in libg++).

	4) The internal interfaces of a module should be in a separate set
	of headers that is not published.  For example, my tree library has
	two headers, "Tree.h" and "Tree/Own.h", and the latter contains the
	declarations of utility entities used by the other sources in the
	library, and is not published. Splitting the header is better than
	bracketing with #ifdef KERNEL #endif.

	5) Under Unix, published headers ought to be in /usr/include if they
	are for modules implemented at the user level, /usr/include/sys if
	they are for kernel level modules. Internal interfaces ought not to
	be in either; they ought to be in /usr/sys/h or the directory that
	holds the module sources, e.g. /usr/src/lib/libc. If there are
	multiple headers, according to rule 2),

	6) All file global entities internal to a module should be declared
	static. If they cannot, because the module is split in several
	source files, then respect of rule 1 is absolutely essential.

Naturally all these rules are palliatives; what we should really have, and
given C, C++, and Unix and other similar operating systems, we will not
have, is a tree of symbol tables. To have this the best way is to have an
object store, like in RSRE Flex or Cambridge CAP, or some Lisp machines or
systems, but this is wishful thinking... Second best would be something like
Multics, as usual.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

rfg@ics.uci.edu (Ron Guilmette) (12/23/89)

In article <1552@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>Actually things are even worse than Ron Guilmette says...

I know, but I didn't want to scare people.

>of second rate hackers put duplicate names in system headers, but they do
>the following things as well:
>
[stuff deleted]
>
>	3) even worse, a lot of libraries contain externals that are not
>	declared static. This is very dangerous, because you may unwittingly
>	use the same name in your program, and then all hell breaks loose. A
>	particularly bad offender is curses.

Since people generally seem to be so lazy about this particular aspect
of "good" coding, I was thinking of suggesting a -fdefault-static option
for GCC which would make the default linkage (or "storage-class", as you
prefer) in the absence of an explicit specification "static" rather than
"extern".  This could even be useful for old code because you could compile
a given system with it, and then try to link.  The linker would tell you
which items ought to be explicitly declared as extern, and you could then
go and fix *just* those declaration up to be explicitly extern and recompile
again with -fdefault-static, thereby minimizing extern visible symbols.

>In C, where we don't have a proper modularization facility, the following
>guidelines ought to be followed:
>
[stuff deleted]
>
>	2) File names should also start with the modules prefix...

Too late.  ANSI C mandates several include file names which do not
follow this rule.

>Naturally all these rules are palliatives; what we should really have...

What we should really do is to start all over, but I'd rather not. :-)

// rfg

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (12/26/89)

In article <259323F0.15070@paris.ics.uci.edu> rfg@ics.uci.edu (Ron Guilmette) writes:

   Since people generally seem to be so lazy about this particular aspect
   of "good" coding, I was thinking of suggesting a -fdefault-static option
   for GCC which would make the default linkage (or "storage-class", as you
   prefer) in the absence of an explicit specification "static" rather than
   "extern".  This could even be useful for old code because you could compile
   a given system with it, and then try to link.  The linker would tell you
   which items ought to be explicitly declared as extern, and you could then
   go and fix *just* those declaration up to be explicitly extern and recompile
   again with -fdefault-static, thereby minimizing extern visible symbols.

I agree 100%. I agree so much that then I can propose the
equivalently safe, but almost 100% backward trick that makes the
ridiculous volatile keyword useless without virtually loss in
optimization ability and with abosulte safety:

have an option to make "register" the *default* storage class for
block local variables. The only variables that need to be
explicitly declared "auto" are then those whose address is taken,
and the compiler will without any problem flag them out for you.

If you have instead the equivalent trick of having variables
unvolatile by default, you need to manually tag as volatile those
that need be, and if you don't, you get nasty bugs.

I reckon that an option to disable volatile and make register the
default storage class for locals would provide virtually all the
benefits as to optimization (caching globals is virtually
irrelevant), without any risk, and would make it very easy to
develop or upgrade existing programs that relied on Classic C
semantics (e.g. the Unix kernel) in a multithreaded environment.

Consider GNU C under Mach...

--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

brooks@maddog.llnl.gov (Eugene Brooks) (12/26/89)

In article <PCG.89Dec25215018@rupert.cs.aber.ac.uk> pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes:
>I agree 100%. I agree so much that then I can propose the
>equivalently safe, but almost 100% backward trick that makes the
>ridiculous volatile keyword useless without virtually loss in
>optimization ability and with abosulte safety:
The volatile keyword is not ridiculous, and it is very useful.
Its orginal domain, to support device register access where memory
values would change in a spontaneous way, has expanded to shared memory
multiprocessing.  I am sure that the ANSI committee did not have multiprocessing
in mind when they hatched volatile, but they did us a big favor with it.
It is best to not statically delcare a variable as volatile, however,
it is best to declare a specific reference volatile with a cast when
you need to be sure that the compile does not screw you.

brooks@maddog.llnl.gov, brooks@maddog.uucp