[comp.lang.c] Question about linking files

bobmon@iuvax.cs.indiana.edu (RAMontante) (03/26/89)

Hey, gurus and standards freekz...

A question is going around comp.sys.ibm.pc just now, to wit:  somebody has
noticed that (TurboC's) linker links all the routines in a file into his
object file, whether a particular routine is actually called or not.  This
bothers him, because the resulting executable is larger than it really needs
to be.

Among the flurry of LIBrarian tricks to defeat this, a couple of people are
saying that this isn't (TurobC/MSC/whoever)'s fault, because C requires that
all routines (all symbols, maybe?) in a file be linked in if any of them are.

Are there any nuggets of correctness in all this?  Would someone care to
shine some light on it all?

(disclaimer:  I may have oversimplified the discussion.  If you have a good
answer, it's probably worth crossposting to comp.sys.ibm.pc.)

chris@mimsy.UUCP (Chris Torek) (03/26/89)

In article <18925@iuvax.cs.indiana.edu> bobmon@iuvax.cs.indiana.edu
(RAMontante) writes:
>... a couple of people are saying that this isn't (TurobC/MSC/whoever)'s
>fault, because C requires that all routines (all symbols, maybe?) in a
>file be linked in if any of them are.

Given that the pANS does not have the concept of a `library', or
even of `separate compilation', this is clearly false.  It is, however,
difficult to tell which of several code and/or data sections may
be required.  Consider, for instance, the following:

	static void a(), b();
	static void (*table)[2] = { a, b };

	entry_point(int n) { go(&table[0], n); }

	static void go(void (**tab)(), int n) {
		(*tab[n])();	/* this calls either a() or b() */
	}

	static void a() { (void) printf("a called\n"); }
	static void b() { (void) printf("b called\n"); }

It is not possible to tell, at compile time, which of `a' and `b' will
be called.  If `n' is deleted from entry_point(), and we call `go' with
0, b() can be elided.  Discovering this is quite difficult.  More
generally, if the link format uses offsets to locations that can be
resolved at compile time (such as from entry_point() to go(), if the
machine supports pc-relative calls), there may be insufficient
information in the object files.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

jacobs%cmos.utah.edu@wasatch.UUCP (Steven R. Jacobs) (03/26/89)

In article <16541@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <18925@iuvax.cs.indiana.edu> bobmon@iuvax.cs.indiana.edu
>(RAMontante) writes:
>>... a couple of people are saying that this isn't (TurobC/MSC/whoever)'s
>>fault, because C requires that all routines (all symbols, maybe?) in a
>>file be linked in if any of them are.
>
>Given that the pANS does not have the concept of a `library', or
>even of `separate compilation', this is clearly false.  It is, however,
>difficult to tell which of several code and/or data sections may
>be required.  Consider, for instance, the following:
>
>	static void a(), b();
>	static void (*table)[2] = { a, b };
>
>	entry_point(int n) { go(&table[0], n); }
>
>	static void go(void (**tab)(), int n) {
>		(*tab[n])();	/* this calls either a() or b() */
>	}
>
>	static void a() { (void) printf("a called\n"); }
>	static void b() { (void) printf("b called\n"); }
>
>It is not possible to tell, at compile time, which of `a' and `b' will
>be called.  If `n' is deleted from entry_point(), and we call `go' with
>0, b() can be elided.  Discovering this is quite difficult.

Yes, but suppose the following (common) situation occurs:

	extern void a(), b(), c(), d(); /* NOTE no longer static */
	static void (*table)[2] = { a, b }; /* c() and d() not used here */
	/* extra lines omitted */

and in a different file:

	void a() { (void) printf("a called\n"); }
	void b() { (void) printf("b called\n"); }
	void c() { (void) printf("c called\n"); }
	void c() { (void) printf("d called\n"); }

This situation must not be too hard to detect, since lint will give
"function defined but not used" messages in this case.  Admittedly,
this might not be an appropriate thing for the linker to handle,
but it would sure be nice if the librarian would detect such cases
and treat them as if they were compiled from separate files, at least
when no variables of "file-only" scope are involved.  Static functions
that are not used could be completely eliminated, and library functions
that are similar could be conveniently grouped into a single source file.
I find it easier to manage libraries of 20,000 lines of code when they are
in a few dozen files of a few hundred lines each as opposed to hundreds of
files, many of which contain similar functions that are only 5 to 10 lines
of code.  The limitations of present linkers/librarians force my programs
to be larger than they need to be, or force me to deal with hundreds of
source files.


Steve Jacobs  ({ihnp4,decvax}!utah-cs!jacobs, jacobs@cs.utah.edu)

chris@mimsy.UUCP (Chris Torek) (03/27/89)

In article <16541@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>Given that the pANS does not have the concept of a `library', or
>even of `separate compilation', ...

I should probably rephrase that.  It does have something called
`external linkage'; it just does not tie it specifically to `separate
compilation' and `libraries'.  (The difference is that between what
must be and what usually is.)

I should also restate my point, which is this: You cannot tell which
functions are needed---consider the program

	main()
	{
		while (the_machine_continues_to_exist())
			/* void */;
		/* never gets here */
		library_function_f();
		exit(0);
	}

---so the best you can do is an approximation (`the function strftime
will never be called; the function strcpy might be called; ...').
Unfortunately, unless the link file format has been carefully defined
and the compiler cooperates, you cannot even do that:

	_foo:	.globl	_foo
		.word	0
		movl	$_foo+foosize,r0
		calls	$0,(r0)
		ret
		.align	2
	0:	.set	foosize,0b-_foo

The VAX-assembly-code function foo() calls whichever function is
linked immediately following it, so eliding that function because it
appears unused changes the execution.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

gwyn@smoke.BRL.MIL (Doug Gwyn ) (03/27/89)

In article <16541@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>Given that the pANS does not have the concept of a `library', or
>even of `separate compilation', ...

The pANS does recognize the notion of library and separate compilation;
see Section 2.1.1.1.

According to the pANS, a program consists of a set of translation units
linked together and communicating by well-defined "external" interfaces.
Nowhere in the pANS (that I could find) is there any idea that only a
portion of a translation unit might be linked into a program.  The
means by which available translation units are selected for linking
together into programs is not within the scope of the standard,
although the usual link-editing of multiple object modules with others
selected from libraries to satisfy external references is clearly among
the methods envisioned.

I would say that any link process which dropped a portion of an object
module (presumed to be produced from a single translation unit) would
be non-standard conforming (unless the dropped portion had no detectable
effect on the final program).

jfc@athena.mit.edu (John F Carr) (03/27/89)

In article <16546@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>>Given that the pANS does not have the concept of a `library', or
>>even of `separate compilation', ...

>The VAX-assembly-code function foo() [deleted] calls whichever function is
>linked immediately following it, so eliding that function because it
>appears unused changes the execution.

Does the standard have anything to say about linking to programs written
in other languages, or even compiled by different compilers?  Would a 
compiler & environment (i.e. linker) that loaded by C function instead of
file (and therefore broke the deleted example) be conforming?  Assume
that this hypothetical compiler works correctly on all C programs.

A more important problem is this:  there are at least two strategies for
passing structures to and from functions.  One is to pass the structure
on the stack, the other is for the caller to pass a pointer.  Modules
compiled using different methods will not work together.  Does the
standard offer any guidance in this case?  As long as not all compilers are
bug-free, there will be reasons to use different compilers on parts of the
same program.  

(My guess at the answer to the above questions: "the standard can not attempt
to define behavior when different compilers are used for different source
files, or when interacting with languages other than standard C.")

--
   John Carr             "When they turn the pages of history,
   jfc@Athena.mit.edu     When these days have passed long ago,
   bloom-beacon!          Will they read of us with sadness
   athena.mit.edu!jfc     For the seeds that we let grow?"  --Neil Peart

henry@utzoo.uucp (Henry Spencer) (03/28/89)

In article <10126@bloom-beacon.MIT.EDU> jfc@athena.mit.edu (John F Carr) writes:
>(My guess at the answer to the above questions: "the standard can not attempt
>to define behavior when different compilers are used for different source
>files, or when interacting with languages other than standard C.")

Right.  It is not guaranteed that it will even be possible.  (Plausible
example:  an interpretive implementation might not support such things
at all.)  These are "quality of implementation" issues.
-- 
Welcome to Mars!  Your         |     Henry Spencer at U of Toronto Zoology
passport and visa, comrade?    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

Tim_CDC_Roberts@cup.portal.com (03/28/89)

I think that this discussion of external linkages has missed the point of
the original poster, although I could be missing the point too.

If I read the original correctly, he is saying that given:

   MY.LIB  <==  A.OBJ  entry points A1  A2  A3  externals C1
                B.OBJ  entry points B1  B2  B3  
                C.OBJ  entry points C1  C2  C3

and given

   main () { a1();  a3(); }

then the Turbo C linker will include ALL 3 modules in the resulting
executable, whereas it is plain that B.OBJ is not required.  If this
is, in fact, the case, then the Turbo C linker is _broken_.  The Microsoft
Linker will include only A.OBJ and C.OBJ in the executable.

Now, if A, B, and C are all modules on a single OBJ, and that OBJ is fed
to the linker, then one would expect all three to appear on the executable.
I don't think that was the question, however.

Tim_CDC_Roberts@cup.portal.com                | Control Data...
...!sun!portal!cup.portal.com!tim_cdc_roberts |   ...or it will control you.

bobmon@iuvax.cs.indiana.edu (RAMontante) (03/28/89)

Tim_CDC_Roberts@cup.portal.com <16315@cup.portal.com> :  [ condensed ]
-
-If I read the original correctly, he is saying that given:
-
-   MY.LIB  <==  A.OBJ  entry points A1  A2  A3  externals C1
-                B.OBJ  entry points B1  B2  B3  
-                C.OBJ  entry points C1  C2  C3
-
-   main () { a1();  a3(); }
-
-	[ ... ]
-
-Now, if A, B, and C are all modules on a single OBJ, and that OBJ is fed
-to the linker, then one would expect all three to appear on the executable.


I didn't mean all entries in a LIBrary, I meant all modules in the same
original OBJ.  In fact the simplest "fix" is to make a library out of it.
The question was:  is such behavior (linking everything in the OBJ)
necessary for some reason, or is it more likely to be a hack for
speed/simplicity of compilation (or a bug)?

Sorry I wasn't clear the first time.

Devin_E_Ben-Hur@cup.portal.com (03/29/89)

> I think that this discussion of external linkages has missed the point of
> the original poster, although I could be missing the point too.
> 
> If I read the original correctly, he is saying that given:
> 
>    MY.LIB  <==  A.OBJ  entry points A1  A2  A3  externals C1
>                 B.OBJ  entry points B1  B2  B3  
>                 C.OBJ  entry points C1  C2  C3
> 
Nope, the original poster made no mention of libraries.  He wished the linker
to treat: LINK A+B+C,P.EXE;  as if all the functions in A,B,&C were
independantly compiled then made into a library and linked only if referenced.

> and given
> 
>    main () { a1();  a3(); }
> 
> then the Turbo C linker will include ALL 3 modules in the resulting
> executable, whereas it is plain that B.OBJ is not required.  If this
> is, in fact, the case, then the Turbo C linker is _broken_.  The Microsoft
> Linker will include only A.OBJ and C.OBJ in the executable.
> 
The turbo linker will perform just link the uSoft linker for this.  Even if
it did include B.OBJ, it would not be _broken_  merely an inferior
implementation.  A broken linker produces an incorect program, including
b.obj does not make the output incorrect, merely larger than neccessary.

> Now, if A, B, and C are all modules on a single OBJ, and that OBJ is fed
> to the linker, then one would expect all three to appear on the executable.
> I don't think that was the question, however.
> 
> Tim_CDC_Roberts@cup.portal.com                | Control Data...
> ...!sun!portal!cup.portal.com!tim_cdc_roberts |   ...or it will control you.

Devin_Ben-Hur@Cup.Portal.Com
...ucbvax!sun!portal!cup.portal.com!devin_ben-hur

bright@Data-IO.COM (Walter Bright) (03/30/89)

In article <18980@iuvax.cs.indiana.edu> bobmon@iuvax.cs.indiana.edu (RAMontante) writes:
>The question was:  is such behavior (linking everything in the OBJ)
>necessary for some reason, or is it more likely to be a hack for
>speed/simplicity of compilation (or a bug)?

It is expected behavior, that if you specify a .OBJ file to the linker,
it'll link it in.

Suppose, for example, you create a C file that has only

	static char copyright[] = "Copyright (C) by XYZ Corp";

in it, and you want that string imbedded in the resulting EXE file.
This file would be compiled and then placed in the list of OBJs to be
linked together. If the linker ignored it, because it didn't satisfy any
unresolved externals, then that is a BUG.

The order that OBJs are specified to the linker is also important.

The purpose of library files is to link in only the object files necessary
to resolve any remaining undefined externals.

On a related issue, the structure of an OBJ file closely follows that of
an assembly language source file, i.e. it is *not* organized as a sequence
of functions. OBJ files are a sequence of bytes, with public and
external symbols. What the function boundaries are, or even if the bytes
represent code or data, is irrelevant to the format of the OBJ file.

Expecting object files to have more structure to them is nice for the
future, but for now and for compatibility with existing practice, it's
impractical.

This lack of structure in the OBJ file is a major obstacle when creating
a symbolic debugger. So everyone who does a symbolic debugger has invented
extensions to the format in order to add structure. Unfortunately, the
problems are:
1. This is only added if symbolic debug info is requested.
2. It adds quite a bit to the size of the file, slowing down linking.
3. Microsoft and Borland have decided to keep their formats secret, thus
   doing a major disservice to the community. (Let's here it for open
   standards!)

P.S. My comments apply to OBJ files on MS-DOS, I'm not familiar with
COFF.

dg@lakart.UUCP (David Goodenough) (04/14/89)

Guts of argument:

file1:		proc1() { proc3(); }

file2:		proc2() { }

file3:		proc3() { }

N.B. file2 is not necessary to resolve inclusion of file1.

The question: Should inclusion of file2 on the command line cause
inclusion of the code for proc2, even though it is not needed to
resolve any undefined labels?

The answer: (IMHO) Yes.

There is a difference between object files (UNIX .o) and Libraries (UNIX .a)
ALL stuff in a .o should be included because it may be needed to resolve
a forward reference. When I write programs, I can produce just 14K of
object from 20 source files (OK I'm using Z80 assembler, but the principle
still holds true in any environment), and I have external references all
over hell's half acre. Now I don't want any damn linker trying to second
guess what I mean. As far as I know, the L80 linker ( father of the MS-DOS
mess????? ) had a /S option to search:

So if I said:

L80 FILE1,FILE2,FILE3/S .....

FILE1.REL and FILE2.REL would be linked in their entirity, needed or not,
but FILE3 would be searched, and only used to resolve current undefined
labels. The linker I use now (ZLINK) does it, but automatically, based on
the filename extension: .O for mandatory linkage, and .L for libraries.
Also the internal format of a .L file is a little different, but that's
another story.
-- 
	dg@lakart.UUCP - David Goodenough		+---+
						IHS	| +-+-+
	....... !harvard!xait!lakart!dg			+-+-+ |
AKA:	dg%lakart.uucp@xait.xerox.com		  	  +---+