[comp.sys.sun] Shared libraries linker error

hans@uunet.uu.net (Hans Buurman) (02/08/89)

In article <615@dutrun.UUCP> I write:
	(program running under 3.5 did not run under 4.0 due to function
	select() in both the program and libsunview.a)

>Could somebody out there
>a) tell me which rules the linker uses ?
>b) guess where Sun screwed up ?

As several people have pointed out this was not an unreasonable behaviour
from the compiler. My apologies to Sun for suggesting this. I am left
wondering why the program used to work in the first place.....

	Hans

Disclaimer: any opinions above are my own.

Hans Buurman                   | hans@duttnph.UUCP
Pattern Recognition Group      | mcvax!hp4nl!dutrun!duttnph!hans
Faculty of Applied Physics     | tel. 31 - (0) 15 - 78 46 94
Delft University of Technology |

mac@mrk.ardent.com (Michael McNamara) (02/08/89)

In article <615@dutrun.UUCP> mcvax!duttnph!hans@uunet.uu.net (Hans Buurman) writes:
>One user on a Sun 3/60 that has recently been upgraded to SunOS 4.0 was
>complaining about a function dumping core that should not have been used
>at all. It turned out that he had a function called select() in his
>program....
>Could somebody out there
>a) tell me which rules the linker uses ?

There is this neat facility on UNIX. :-) It's called man. :-) If you want
to know how something works, you type man. :-) Try man ld. :-)

The program works as coded.  From your decription, it sounds like his
compile command was

cc -o myprog myprog1.o myprog2.o -lsunwindow

This is automatically expanded to:

cc -o myprog myprog1.o myprog2.o -lsunwindow -lc

One of the  linker's job is to link up references to external routines.
It's rules are well defined. Look in the manual for ld.  This behaviour is
unchanged since early unix days:
...
     If a named file is a library, it is searched exactly once at
     the  point  it  is  encountered  in the argument list.  Only
     those routines defining an unresolved external reference are loaded...
	                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You've already supplied select(), hence your version is used.

As your fictious programmer supplied a routine called select(), and linked
it ahead of the libraries, his select() would be used instead of any
select defined in any library.  Although he didn't call select, one of the
library routines did, and got his instead of the one in libc.a.

This is all as it should be, and allows one to quietly overload routine
names in any library.  Note that if select() was multiply defined in
object files, a warning would be posted.

Now the ability to quietly override library routines is of somewhat
questionable utility, and violates the principle of least suprize;
although I have used this feature on occasion.  

I used it once to attach statistics collecting to fopen/fclose.  I
supplyed my fopen/fclose, which jot down statistics, then call
openf/closef.

Then I extracted the systems versions of fopen/fclose via "ar x libc.a"
and used emacs to change the names of the procedure definitions in the .o
from fopen/fclose to openf/closef. (being careful not to change calls to
fopen/fclose lurking in the library).

Then when I link, everything calls my fopen, which calls openf, from my
doctored .o's extracted from libc.a and the extra fopen definition (from
libc.a) is not used.

>b) guess where Sun screwed up ?  

Sorry.  Not Sun's mistake.  Program works as coded.  

Sun (and every other Unix vedor) might want to change ld so that it issues
a warning message when a procedure supplied in a library is already
defined from some earlier object or library...

[[ That would be nice.  I have often been tempted to name a function
"wait" for whatever reason, and I'm sure that naive users have made
similar mistakes.  --wnl ]]

Michael McNamara 
  mac@ardent.com

gandalf@csli.stanford.edu (Juergen Wagner) (02/08/89)

[I am sending the reply to the entire list because I think there's been
some confusion about the new select(2) syntax.]

In article <615@dutrun.UUCP> mcvax!duttnph!hans@uunet.uu.net (Hans Buurman) writes:
>...
>One user on a Sun 3/60 that has recently been upgraded to SunOS 4.0 was
>complaining about a function dumping core that should not have been used
>at all. It turned out that he had a function called select() in his
>program.

Hmm... There is also a system call select(2) in SunOS4.0. This isn't
particularly notable but the syntax has changed from 3.x to 4.x. Unless
done deliberately, one shouldn't use names of system call for one's own
functions.

>One of the SunView routines he called (to initialize a panel) also used a
>function called select(), which is in ndet_select.o in libsunwindow.a.  It
>looks like the compiler had linked the sunwindow calls to select() to his
>own program.

/*
 * Ndet_select.c - Notifier's version of select....

And if you look at the declaration:

extern int
select(nfds, readfds, writefds, exceptfds, timeout)
        register int nfds;
        fd_set *readfds, *writefds, *exceptfds;
        struct timeval *timeout;

What the linker did was correct. It used the libsunwindow.a version of
select because this library was mentioned first on the cc line: something
like

	cc -o foo foo.c -lsunwindow -lpixrect

Even if you used the standard version of select, the core dump should
still be there. Consult your man page for select(2) for the changed
syntax.

-- 
Juergen Wagner		   			gandalf@csli.stanford.edu
						 wagner@arisia.xerox.com

[[ On this topic, I have noticed that the old select kernel call was
retained for backward compatibility.  A program that uses "select" and
that is compiled and linked on a 3.x machine will, under most
circumstances, still work correctly under 4.x (because it's using the old
select kernel call).  But before you can compile it on a 4.x machine, you
*must* make some changes.  Read the new manual page for select(2) to see
how it must be changed.  This backward compatibility move was likely
documented, but I'm too busy (or is that "lazy") to go look it up right
now.  To be more specific:  the new select can handle file descriptor
masks longer than 4 bytes (thus it can handle file descriptors >= 32).
The old one assumed that the mask was 4 bytes.  You can use old
executables provided that you never try to "select" on a fd >= 32.
--wnl ]]

guy@uunet.uu.net (Guy Harris) (02/22/89)

>[[ On this topic, I have noticed that the old select kernel call was
>retained for backward compatibility.

No, it wasn't.  "select" in 3.x was system call number 93, and "select" in
4.0 is system call number 93.  There is no "old select kernel call" in
4.0.

>A program that uses "select" and that is compiled and linked on a 3.x
>machine will, under most circumstances, still work correctly under 4.x
>(because it's using the old select kernel call).

A program that uses "select" "properly" and that is compiled and linked on
a 3.x machine will, assuming no other binary compatibility problems occur,
still work correctly under 4.x under *any* circumstances (modulo bugs in
4.x).

A program that uses "select" "improperly" - i.e., passes in pointers to
"int"s rather than "fd_set"s, but passes the result of "getdtablesize()"
as the first argument under the assumption that it will always return a
number less than or equal to the number of bits in an "int" - is quite
likely to fail under 4.x. 

Replace "3.x" with "4.2BSD", and "4.x" with 4.3BSD, and the above
statements are pretty much correct; the change to "select" from 3.x to 4.0
is a change from the 4.2BSD version to the 4.3BSD version.  I don't know
whether the 4.2BSD or SunOS 3.x documentation made it clear that it was a
bad idea to use "getdtablesize" - or, at least, to use it without cutting
its value off at 32 - because the system might be changed to support more
than (# of bits per "int") file descriptors, or not; it may well have
*encouraged* the use of "getdtablesize()", which is unfortunate. 

>But before you can compile it on a 4.x machine, you *must* make some
>changes.

Only if you've been using "select" "improperly", as indicated.

[[ I realized that there didn't need to be two separate kernel calls just
the other day while tracking down yet another select related bug.  When I
saw the described behavior (a 3.x executable that uses select still
working under 4.0) I assumed that they had just retained the old call.
After delving into things deeper, I know *exactly* why things work the way
they do.  An fd_set is an array of longs holding the bitmask.  The first
long corresponds to the *lowest* numbered file descriptors.  Therefore, if
you never use a "width" greater than 31, your program stands a very good
chance of working under 3.x (also 4.2BSD) as well as 4.x (also 4.3BSD),
because the kernel will never need anything beyond the first long.  Neat,
huh?  As for using select "properly" under 3.x, please tell me where in
the 3.x documentation it is stated that one must use a fd_set pointer in a
select call.  The manual page sure doesn't say anything about it.  It also
encouraged the use of "getdtablesize", unfortunately.  Although fd_set was
defined in <sys/types.h>, none of the macros associated with it were
defined anywhere (much less documented).  This all made it rather hard to
use select "properly" under 3.x.  --wnl ]]