[mod.computers.vax] main

LEICHTER-JERRY@YALE.ARPA.UUCP (06/18/86)

Reply-To: <LEICHTER-JERRY@YALE.ARPA>

    >The use of this "first function will be transfer address" feature should
    >probably be avoided - it's non-portable (though I'd guess it's in VAX C
    >exactly because other implementations did this - Unix is one, in fact -
    >and programs came to rely on it.
    >							-- Jerry
    
    Don't badmouth UNIX so quickly.  Here's what happens in SysV:
    
    $ cat foo.c
    foo()
    {
    	printf("hello, world!\n");
    }
    $ cc -o foo foo.c
    undefined		first referenced
     symbol  		    in file
    main                		/lib/crt0.o
    ld fatal: Symbol referencing errors. No output written to foo
    
    The interpretation of this is that the UNIX linker was looking for a
    main() to satisfy the reference by the startup routines contained in the
    object module /lib/crt0.o (there is a _start() routine in /lib/crt0.o
    which the linker treats as a hardwired "transfer address" and which calls
    main()).  The linker did not find main() so it barfed.  Simple, no?

When I discovered that VAX C made the first routine the entry point by
default, I figured there was SOME reason for going to the trouble, so I
pulled out my 4.2bsd documentation and checked out ld.  It says this:

	[In the command description]  The entry point of the output is the
	beginning of the first routine (unless the -e option is specified).

	[The description of -e]  The following argument is taken to be the
	name of the entry point of the loaded program; location 0 is the
	default.

Note that these two statements are not obviously consistent, and, in fact,
had better disagree on any system without something like separate I&D spaces
(else the "first routine" would have to be at 0, hence its address would be
at 0 - but that's NULL!)  Nor do they account for main().  I see no reference
at all to _start().

Since you mentioned System V, I checked some AT&T 3B2 ld documentation I have
here.  The only references to the entry point are as follows:

	[The description of -e epsym]:  Set the default entry point address
	for the output file to be that of symbol epsym.

	[Under Caveats]  When the link editor is called through cc(1), a
	startup routine is linked with the user's program.  [The startup
	routine arranges to call exit(), etc.; no mention of entry points.]

Again, no reference to _start() - or to what the entry point would be if
-e were left out.  The documentation of the cc command doesn't say either.
But note that your "simple" example, and hardwired entry point, are apparently
NOT ld's doing, but cc's!

At this point, I got curious to see what the reality of the situation was.  So
I tried your little foo program out on our local Celerity (4.2bsd).  "cc foo"
produces "Undefined:  _main", and running the resulting a.out produces an
immediate "Invalid address".  However, a foo.o gets left around.  So I did an
"ld foo.o".  This led to "Undefined:  _printf".  Well, getting there.  I tried
"ld foo.o /lib/libc.a".  No errors!  Running a.out produces "Hello world",
followed by an access violation.  Adding an explict exit(0) fixes that nicely.
In fact, the resulting program even receives its command line arguments
properly!  (Well, it gets argc; I didn't bother to check argv, but it's pretty
certain to be correct - both are coming from the Shell's exec.)  My 3B2 is
down at the moment so I have no System V implementation to try it on, but are
you still going to bet that the first routine WON'T end up as the entry point?

The thing that's so wonderful about Unix is its portability.  And consisten-
cy.  And documentation.  And, of course, the legions of Unix users who can be
counted on to view ANY non-laudatory mention of Unix as "badmouthing" it.

How is the comment - right or wrong - that Unix makes the default entry point
the first routine "badmouthing" Unix?  At worst, I was claiming that a lot of
non-portable C code got written under Unix (since K&R certainly contains
nothing to indicate that there can be an entry point other than main()).  And
if you don't believe THAT, then you haven't looked at much Unix code.

							-- Jerry
-------

jso@edison.UUCP (John Owens) (06/23/86)

--  So I pulled out my 4.2bsd documentation and checked out ld.  It says this:
--  
--  	[In the command description]  The entry point of the output is the
--  	beginning of the first routine (unless the -e option is specified).
--  
--  	[The description of -e]  The following argument is taken to be the
--  	name of the entry point of the loaded program; location 0 is the
--  	default.
--  
--  Note that these two statements are not obviously consistent [....]
--  the "first routine" would have to be at 0, hence its address would be
--  at 0 - but that's NULL!)  Nor do they account for main().  I see no 
--  reference at all to _start().

Your 4.2bsd documentation set (if you got it from Berkeley) refers to
VAXen only, where the first routine *is* at location 0; the loader
just loads sequentially.  (C does *not* guarantee that address 0
doesn't contain anything, but that's another discussion.)  The loader
itself knows nothing about main or start; those are features of C, and
have nothing to do with any other language that ld might load.  You
just might be writing in assembler....

--  Since you mentioned System V, I checked some AT&T 3B2 ld documentation I
--  have here.  The only references to the entry point are as follows:
--				[....]
--  Again, no reference to _start() - or to what the entry point would be if
--  -e were left out.  The documentation of the cc command doesn't say either.
--  But note that your "simple" example, and hardwired entry point, are
--  apparently NOT ld's doing, but cc's!

The documentation seems to be lacking here.  (I've never been very
fond of ATTIS's rewritten documentation.)  The definition of the C
language really does require that main be the starting point; I
suppose that didn't need to be part of the man page.  start is not a
user-visible feature, and certainly doesn't have to have that name.
The hardwired entry point is a ld feature; the reference to main a
feature of C.  Nonetheless, the entry point will still be the first
routine.  Read on....

--  I tried your little foo program out on our local Celerity (4.2bsd).
--  "cc foo" produces "Undefined: _main", and running the resulting a.out
--  produces an immediate "Invalid address".  However, a foo.o gets left
--  around.  So I did an "ld foo.o".  This led to "Undefined: _printf".

When cc invokes ld, it looks something like this:
	/bin/ld /lib/crt0.o foo.o -lc		[-X flags and such left off]
The crt0 file is loaded at address 0, and refers to main and exit.
foo.o must satisfy main, and /lib/libc.a will satisfy printf and exit.

--  Well, getting there.  I tried
--  "ld foo.o /lib/libc.a".  No errors!  Running a.out produces "Hello world",
--  followed by an access violation.  Adding an explict exit(0) fixes that.

This is certainly not supported.  You were lucky.  It's dependent on
the implementation of the exec(2) system call whether or not you'll
get your command line arguments this way.

--  [...] but are you
--  still going to bet that the first routine WON'T end up as the entry point?

I won't bet on anything if the loader isn't invoked properly....

--  At worst, I was claiming that a lot of non-portable C code got written
--  under Unix (since K&R certainly contains nothing to indicate that
--  there can be an entry point other than main()).  And if you don't
--  believe THAT, then you haven't looked at much Unix code.

That code you've been looking at is going to have a hard time being
ported to most UNIX systems then, much less any other system with a C
compiler.  I've been porting, adapting, and randomly mangling C code
for UNIX from a variety of sources for years, and haven't run into a
single program that doesn't have an entry point of main().  Would you
refer me to such a program that I might have access to, like something
from USENET, a USENIX tape, or a System V or BSD distribution?

--  							-- Jerry

	John Owens @ General Electric Company
	edison!jso%virginia@CSNet-Relay.ARPA		[old arpa]
	edison!jso@virginia.EDU				[w/ nameservers]
	jso@edison.UUCP					[w/ uucp domains]
	{cbosgd allegra ncsu xanth}!uvacs!edison!jso	[roll your own]

LEICHTER-JERRY@YALE.ARPA (06/26/86)

To: John Owens <edison!jso%virginia.csnet@CSNET-RELAY.ARPA>
In-Reply-To: John Owens <edison!jso%virginia.csnet@CSNET-RELAY.ARPA>, Mon, 23 Jun 86 09:34:42 edt

In general, I agree with what you say.  A couple of small comments:

    C does *not* guarantee that address 0 doesn't contain anything, but
    that's another discussion.
C DOES guarantee that the integer constant 0, cast to any pointer type, will
never be equal to a pointer to any actual object of that type.  In principle,
the cast could change the bit pattern; it almost never does - certainly it
does not on a VAX.  Thus, _start == NULL.  Most users will never see this,
but an implementer of _start() would.  (Minor point, but the fact is there IS
an inconsistency - nothing keeps you from doing an extern void _start() and
looking at the resulting pointer.)

    --  Well, getting there.  I tried
    --  "ld foo.o /lib/libc.a".  No errors!  Running a.out produces "Hello
    --  world", followed by an access violation.  Adding an explict exit(0)
    --  fixes that.
    
    This is certainly not supported.  You were lucky.  It's dependent on
    the implementation of the exec(2) system call whether or not you'll
    get your command line arguments this way.
Actually, I've since been informed that, while argc is passed correctly, argv
is screwy and envp isn't there at all.
    
    --  [...] but are you
    --  still going to bet that the first routine WON'T end up as the entry
	point?
    
    I won't bet on anything if the loader isn't invoked properly....
That gets to the crux of things:  The "proper" way to invoke the loader is
undocumented - you must use cc.  How then do you deal with a program written
in multiple languages?  Basically, you ask a wizard....

I find it rather amusing that Unix, which (quite properly) argues for separate
modules with separate functions, and clean interfaces between them, glues the
loader and the C compiler together in a very ad hoc, undocumented way!  (Side
comment:  You at least understand what Unix is doing here.  I had a couple of
other correspondents on this issue who had no real idea what was going on, and
ended up effectively claiming that the loader really is part of the compiler.
If that's the case, (a) it's going to be very hard to deal with multiple
compilers, ever; (b) it becomes hard to justify why the loader doesn't do more
to help the compiler/user out - e.g., check for type clashes in external
function calls.  This would have been trivial to do if the implementers had
wanted to, with minimal overhead, and much faster than lint.  Yes, it would
have required additional facilities in C - argument definitions as in ANSI C -
but then the language, compiler, and loader were developed by the same people
at the same time.  As for those other correspondents, their lack of knowledge
didn't slow them down a bit in defending their incorrect religeous state-
ments....) 

    --  At worst, I was claiming that a lot of non-portable C code got written
    --  under Unix (since K&R certainly contains nothing to indicate that
    --  there can be an entry point other than main()).  And if you don't
    --  believe THAT, then you haven't looked at much Unix code.
    
    That code you've been looking at is going to have a hard time being
    ported to most UNIX systems then, much less any other system with a C
    compiler.  I've been porting, adapting, and randomly mangling C code
    for UNIX from a variety of sources for years, and haven't run into a
    single program that doesn't have an entry point of main().  Would you
    refer me to such a program that I might have access to, like something
    from USENET, a USENIX tape, or a System V or BSD distribution?
If you read more closely what I said, you'll see that I didn't claim to have
any examples of this kind of thing...I just claimed that, somewhere out
there, they were likely to exist.  I know the people who did the VAX C com-
piler and run-time support, and they've tried really hard to be compatible
with Unix.  Unfortunately, that can be very hard to do, since Unix programs
make use of a lot of undocumented "features".  For example:  There is abso-
lutely nothing in any definition of C that says that in:

	f(a,b)
	int a,b;
	{	int *x;

		x = &a + sizeof(int);
		...
	}

x will point to b.  In a field-test version of VAX C V2.0, this was NOT true.
(The VMS procedure-call spec says that the argument list is owned by the
CALLING procedure, which may place it in read-only memory, re-use it, etc.;
the CALLED procedure may only read it.  In that version of C, if you ever
took the address of a formal argument, the value passed was copied to a
temporary cell on entry, and the address you got was of the temporary.  As
far as documented C semantics are concerned, this is a completely correct
implementation - but it prevents you from screwing with the caller's argument
list.)  Anyway, cries of pain came from all over:  Despite the existence of
varargs - which WAS provided with that release, BTW - it turns out that there
are LOTS of C programs that assume you can scan through an argument list this
way.  So the final version of V2 put things back as they were, requiring a
waiver of conformance with this aspect of the procedure-call spec.  (As it
happens, VAX C (currently) always builds argument lists on the stack and then
discards them, so you can screw around to your heart's content - but try it
with a FORTRAN caller, and things get really weird....)

Anyway, given that Unix programmers have historically grasped at ANYTHING they
can find the least justification for in the documentation - or no justifica-
tion at all - "compatibility" has to mean "put in EVERYTHING you can, even if
you can't think of anyone who's using it.  Someone will come along who wants
it, some day...."  Since the "entry point is the first routine" IS, in fact,
documented - even if only for wizards! - supporting it couldn't hurt....

    	John Owens @ General Electric Company
							-- Jerry
-------

bzs%bu-cs.bu.edu@CSNET-RELAY.ARPA.UUCP (06/29/86)

>From: LEICHTER-JERRY@YALE.ARPA
  ...an attempt to explain _start(), ld, C, null pointers etc...

There are so many horrendous mistakes in this article it would take
a month to straighten it out.

Suffice it to say I simply hope no takes it seriously, he tries to
state things as if he knows what he is talking about, but he doesn't.

Perhaps he could post his article to INFO-C or net.lang.c and
find out how far off it is on almost everything.

Trust me folks, this is one to ignore.

	-Barry Shein, Boston University

[Must I have the energy to go point by point just to warn readers? no.]

kaiser@FURILO.DEC.COM (Systems Consultant) (07/01/86)

>>From: LEICHTER-JERRY@YALE.ARPA
>> ...an attempt to explain _start(), ld, C, null pointers etc...
>
>...
>Trust me folks, this is one to ignore.
>
>	-Barry Shein, Boston University
>
>[Must I have the energy to go point by point just to warn readers? no.]

Thank you, Barry, for your extremely factual and informative comment.  I'm
afraid you have the wrong answer for your final question, however, which falls
under the rule of "put up or shut up".	Otherwise it's just libel.

---Pete

Kaiser%furilo.dec@decwrl.dec.com
decwrl!furilo.dec.com!kaiser
DEC, 2 Iron Way (MRO3-3/G20), Marlboro MA 01752	 617-467-4445

LEICHTER-JERRY@YALE.ARPA.UUCP (07/01/86)

    >From: LEICHTER-JERRY@YALE.ARPA
      ...an attempt to explain _start(), ld, C, null pointers etc...

    There are so many horrendous mistakes in this article it would take
    a month to straighten it out.

    Suffice it to say I simply hope no takes it seriously, he tries to
    state things as if he knows what he is talking about, but he doesn't.

    Perhaps he could post his article to INFO-C or net.lang.c and
    find out how far off it is on almost everything.

    Trust me folks, this is one to ignore.

    	-Barry Shein, Boston University

    [Must I have the energy to go point by point just to warn readers? no.]
I don't know about the energy, you seem to have plenty of that.  Things worth
saying, now, that's another issue.

When you've demonstrated any knowledge of anything related to this issue, I
might take you seriously.  But I have yet to see anything from you beyond the
typical "Unix is the solution, everything else is just the problem".

Grow up.
							-- Jerry
-------

jso@edison.UUCP (John Owens) (07/07/86)

I agree.  Certainly there are many UNIX programs that take advantage
of specific non-portable features!  I wish that more people would use
varargs, but old habits die hard.  varargs being a fairly recent
innovation, many people won't use it just because they're likely to
find more UNIX implementations that'll work the old way than those
with varargs, since everyone wants to stay compatible with existing
programs!

I finally went and read the start() routine, and was surprised to find
that it was written partially in C, with a few "asm" directives.  Many
other UNIX implementations, such as the PDP-11 ones, have this written
in assembler, with no public symbol for the first location.  (The name
_start comes from the fact that the C compiler prepends an underscore
to any external symbol.)

WRT mixing languages, I've found that just as cc will run as on any
assembler files, then invoke the loader appropriately, f77 will invoke
the c compiler on any .c files and the assembler on any .s files, etc.
The (admittedly ad-hoc) solution has usually been to have the
interface to your "new" langauge handle the linking with C modules,
and any others you know about.  Even without this, you can always do
something like:

	cc a.c b.c c.c
	f77 d.f e.f
	mod2 f.m g.m
 -- and then, if your main program is C, with Fortran and Modula-2 routines --
	cc a.o b.o c.o d.o e.o f.o g.o
 -- or if it's Fortran with C and Modula-2 routines --
	f77 a.o b.o c.o d.o e.o f.o g.o
 -- etc, since any interface should just pass .o files on to ld --

One of the things I actually liked about VMS (really!  it does have a
few redeeming qualities) was that all the langauge interfaces were
extremely standardized, to the point that the main rtl was common to
all languages.  Of course it didn't have to be so cumbersome an
interface....

		In hope of universitality-
			-John
			(edison!jso%virginia@CSNET-RELAY.ARPA)

jsdy@hadron.UUCP.UUCP (07/15/86)

Newsgroups: net.lang.c
Subject: Re: Request for Comments
Summary: Combatting nonsense!

In article <870@bu-cs.UUCP> bzs@bu-cs.UUCP (Barry Shein) writes:
>Any comments? This is taken from INFO-VAX (mod.computers.vax):
>Path: bu-cs!harvard!caip!think!nike!styx!YALE.ARPA!LEICHTER-JERRY
>From: LEICHTER-JERRY@YALE.ARPA
>Subject: Re: main() and entry points in C
>Date: Thu, 26-Jun-86 08:50:52 EDT
>
>		...  Thus, _start == NULL.  Most users will never see this,
>but an implementer of _start() would.

If some implementation of C uses _start instead of start as an
entry point, they are asking for trouble.  I have never seen a
C-accessible symbol as an entry.  The entry point, start, used
to be almost always 0; but it isn't ever 0, now, in the latest
releases of AT&T System V Unix.

>    --  still going to bet that the first routine WON'T end up as the entry
>	point?
>    I won't bet on anything if the loader isn't invoked properly....
>That gets to the crux of things:  The "proper" way to invoke the loader is
>undocumented - you must use cc.  How then do you deal with a program written
>in multiple languages?  Basically, you ask a wizard....

On many versions of Unix, it is documented that properly written
multiple-language modules can be compiled together by appropriately
calling the compiler:
	cc myc.c myftn.o ...		or
	f77 myftn.f myc.o ...		or even
	f77 myftn.f myc.c ...

>		..  Since the "entry point is the first routine" IS, in fact,
>documented - even if only for wizards! - supporting it couldn't hurt....

You claim this is a documented undocumented feature?  It isn't
even true.  Specifically,
	never_called() { printf("Hi.  Joe is wrong.\n"); }
	main() { printf("Hello world.\n"); }
will never call me a liar.  Even if you are talking about
straight link-loaded objects (with no header to call _main()),
several common loaders allow one to specify entry points
other than 0.  These include, but are not! limited to, the
AT&T System V ld (again); the CULC Fortran IV Plus linker;
and the DEC VMS linker.

The C header, which cc puts at the head of the compiled objects,
contains the entry point -- which is not at location 0 under all
versions of Unix!  This header moves the argc, argv, and envp to
a location that main() will understand when called as a function;
then calls main() as a function; then calls exit() as a function
whose argument is main()'s return value; and if this doesn't exit,
typically tries to perform an exit system call itself, and then
(in desperation) a halt.

I should say it TYPICALLY does all this.  If it ain't documented
(and especially if it's not in the standard), it ain't guaranteed,
so don't bet the lunch money, Mildred.
-- 

	Joe Yao		hadron!jsdy@seismo.{CSS.GOV,ARPA,UUCP}
			jsdy@hadron.COM (not yet domainised)

LEICHTER-JERRY@YALE.ARPA.UUCP (07/18/86)

I received a private copy of Joe Yao's message, with no indication that it had
also been forwarded to info-vax.  The following is the response I sent him,
slightly amended.  For info-vax readers, there is some repetition here; my
apologies.

Unless someone brings up startling new facts, this is my last message on this
subject.  I promise. :-)
							-- Jerry

    If some implementation of C uses _start instead of start as an
    entry point, they are asking for trouble.  I have never seen a
    C-accessible symbol as an entry.  The entry point, start, used
    to be almost always 0; but it isn't ever 0, now, in the latest
    releases of AT&T System V Unix.
Try the Celerity 4.2bsd port - their entry point is BOTH start (no "_") and
something like _crt0_start.  The latter IS accessible from C - and has a
value of 0.  I'm told that this is not uncommon, although I can't name any
other examples.  I DO know that the Pyramid port - at least the 4.2bsd
universe version, I didn't try the System V one - uses start (no "_") only.
    
    On many versions of Unix, it is documented that properly written
    multiple-language modules can be compiled together by appropriately
    calling the compiler:
    	cc myc.c myftn.o ...		or
    	f77 myftn.f myc.o ...		or even
    	f77 myftn.f myc.c ...
"Many versions" of Unix?  I thought we were talking about one portable OS
here! :-)
    
    You claim this is a documented undocumented feature?  It isn't
    even true.  Specifically,
    	never_called() { printf("Hi.  Joe is wrong.\n"); }
    	main() { printf("Hello world.\n"); }
    will never call me a liar.  Even if you are talking about
    straight link-loaded objects (with no header to call _main()),
    several common loaders allow one to specify entry points
    other than 0.  These include, but are not! limited to, the
    AT&T System V ld (again); the CULC Fortran IV Plus linker;
    and the DEC VMS linker.
Mr. Shein forwarded only the last of a series of messages on this topic.  He
did this because he was interested in flames, not facts.  (Mr. Shein sent me a
note saying he would forward my message, "without editorial comment," to "the
C experts" on info-c/net.lang.c.  His interpretation of "without editorial
comment" allowed him to include a "summary line" of "Combating nonsense".  My,
my.  Mr. Yao's response is the first one, after about 2 weeks, that actually
talks about the issues - there were a couple of others pointing out related
bugs in various microprocessor C implementations.  There's been nothing
further from Mr. Shein.)

The whole discussion started with a question from a VMS C user as to why, if
he didn't provide a main() routine, his VMS C code started at the first
function seen by the VMS linker.  (Well, that wasn't QUITE his question -
others asked that, the original questions shows up below.)  As my original
response pointed out, the Unix linker does exactly the same thing, if you
invoke it "straight"; it's only because of the way cc chooses to invoke the
linker that stuff works out.  If you do things the "supported" way in VAX C,
you get exactly the same results as if you do them the "supported" way on Unix
C.  But, in fact, even if you go beyond the "supported" approaches, the
results on VMS are a pretty close approximation to what you get on Unix.

    The C header, which cc puts at the head of the compiled objects,
    contains the entry point -- which is not at location 0 under all
    versions of Unix!  This header moves the argc, argv, and envp to
    a location that main() will understand when called as a function;
    then calls main() as a function; then calls exit() as a function
    whose argument is main()'s return value; and if this doesn't exit,
    typically tries to perform an exit system call itself, and then
    (in desperation) a halt.
Again, all of these points were brought up in messages that Mr. Shein chose to
omit.

Interestingly enough, the original message complained that, with previous
versions of VMS C, a program without a main(), called at its first function,
received its command line arguments - but that in recent versions this didn't
work "unlike Unix".  In fact, this does NOT work in Unix!

    I should say it TYPICALLY does all this.  If it ain't documented
    (and especially if it's not in the standard), it ain't guaranteed,
    so don't bet the lunch money, Mildred.
K&R defines the semantics of C programs.  They provide a definition for how
C programs start up (in main()).  They also provide a definition for how
functions get called.  Nowhere is there a definition of the semantics of a
complete program WITHOUT a main() function - implementations are on their own
here.  The VMS implementation tries to emulate what most Unix implementations
seem to do.  It comes pretty close.
    
    	Joe Yao		hadron!jsdy@seismo.{CSS.GOV,ARPA,UUCP}
    			jsdy@hadron.COM (not yet domainised)


							-- Jerry
-------