[comp.windows.ms] Windows loader bug. Was "function reference link" bug.

mguyott@mirror.UUCP (Marc Guyott) (12/08/89)

I have received several requests for information on the "link reference"
problem I referred to in my posting asking about any experiences that other
large Windows application developers may have encountered.  Because of these
requests I am posting the information that we have collected about the problem
that we encountered.  I apologize for the format but I do not have the time
to polish this up.  If you are interested, you'll just have to wade through it
(a lot like we did (-; ).

I especially want to thank Bart Wright for allowing me to post his many
informative and insightful writings dealing with this problem.

The algorithm shown below summarizes what you need to watch out for.  For a
clearer understanding of this algorithm you should read the text listed
in the portion of this article below my .signature.

----
My next step on the loader bug was to see if we could piece together from
what Microsoft had told us just what their locking algorithm was, so we
could predict its future behavior.

I wrote a little simulation program under Unix to model lock count behaviors
on different interpretations of how it works, and I believe I found the key:
when the lock count on a library other than KERNEL goes over 32767, the bug
occurs.  KERNEL appears to be some sort of exception, and I suspect there
is a hack in MS-Windows that simply doesn't do lock counts on KERNEL.  The
library that overflows first is usually SYSTEM, one of the Windows DLLs that
we never call directly.

The algorithm I used is the most straightforward one: do a depth-first search
from the application through all libraries it links and all libraries they
link, making sure only to never enter a library that is in your current
"stack" of libraries your path goes through; as each library is encountered,
up its lock count.  I tried a number of other algorithms that make less
sense to me, and they didn't yield numbers to distinguish our working cases
from our crashing cases.
----

We originally discovered this problem while using Windows 2.03.  This same
problem is also found in Windows 2.11.  The algorithm described above has
helped us to deal effectively with this problem for the last 18 months.

                                                         Marc
----
	      "All my life I always wanted to BE somebody.
	       I see now I should have been more specific."
			     Jane Wagner
Marc Guyott					    mguyott@mirror.tmc.com
{mit-eddie, pyramid, harvard!wjh12, xait}!mirror!mguyott
Mirror Systems		 Cambridge, MA	02140	       617/661-0777


/* Written  5:04 pm  Aug 15, 1988 by bart@prism.TMC.COM */

/* ---------- "We're Out of Module References" ---------- */
In case some of you haven't noticed, we have a problem these days with
releasing new code that makes new module references (new links between
libraries).  As I understand it, the short-term plan is to not release any
new code, though development in your private work areas can continue.  If
anyone is interested, I have a set of .EXEs from the end of July that allow
a fair amount of slack in module references.

Here's a nearly final draft of a TAR that we will be sending to Microsoft:

----------------------------------------------------------------------

We believe we have exceeded an internal limit within the MS-Windows
2.03 run-time loader, and it seems to be a limit on the number of
module references.

(As used here, a "module reference" is a reference from one dynamically
linked library to one or more functions within another DLL.  That is,
if DLL #1 calls three exported functions of DLL #2, that constitutes
a single module reference.  If DLL #3 calls two of the same three exported
functions of DLL #2, that is another module reference.  More informally,
a module reference is the sort of thing that requires you to add a new DLL
to the list of DLLs a library links to in the .lnk file.)

We are writing a large Windows application, with about 23 DLLs (so far),
not counting mlibw and mwinlibc.  The total number of module references
is about 175 (including 22 links from the main application to DLLs, and
including links to mlibw and mwinlibc).

The problem we face is a serious system crash upon exiting from our
application.  We are able to use the application heavily without ill
effects.  Regardless of whether we use it a great deal or do nothing but
immediately exit after startup, the problem arises only when exiting.  Note
that the minimal case of entry and immediate exit does execute some of our
initialization code.  The symptom is that the mouse pointer will stop
tracking mouse movements within a few seconds of double clicking on the
close box.  At that point a CTRL-ALT-DELETE has no effect; the machine must
be powered down and up to restart it.

We have been developing this application for several months now, without
any serious problems.  The problem first arose after a recent modification
of two DLLs, "question" and "choose."  We have done a number of experiments
to try to isolate the problem.

The release of the question library that first caused the crash created a
new module reference to the choose library.  However, if we simply comment
out the question library's single instance of the choose function call, the
crash does not occur.  Moreover, when the line of code is included, we can
surround it with message box calls which never display, indicating that it
is never executed.  The fact that the inclusion of a line of code that is
never executed can be the critical factor in causing a crash suggests a
loader or link problem and not an ordinary bug in our code.

We have since found that the addition of a module reference to two completely
different DLLs causes exactly the same problem:

After this problematic question release was withdrawn from our working
environment, a new version of our "streams" library was released that
caused a new module reference to the existing "qnav" library.  The same
crash occurred.  When the streams library's single function call to qnav
was removed, the crash did not occur.  As before, this line of code was not
executed in the simple sequence of starting the application and then
immediately exiting.

We also found that if we removed the preexisting module reference from
streams to a different library (question), the new code would then work
with the call to qnav in it.  That is, we were able to substitute the new
module reference for an old one without ill effect.

After withdrawing the problematic new streams library, we simply tried
adding three new module references to yet another library (phman), and
found that the same crash occurred.

We emphasize that in all these cases, the new release did not contain
any new libraries but rather new links from one library to another
existing library.  Moreover, adding a new function export to a library
is not necessary to create the problem.

It appears that some global run-time resource is being used up.  The next
experiment was to see what was being used up.  The streams library calls
several different exported functions of the phman library.  We added the
stream library's offending call to the qnav library, and removed the single
call from streams to the PhCountRecord function (a phman function).  This
decreased by one the number of cross-library function references, but not
the number of module references, since the other calls to the phman library
remained.  The system still crashed, presumably indicating that the
critical resource is not cross-library function references, but the number
of module references.

Another interesting note is that we could link the streams library under
the Windows 1.03 linker, and the exact same pair of results was observed:
it worked with the call to qnav commented out, and it crashed when the
function call was part of the code.

The output of the EXEHDR program appears reasonable both for libraries that
cause the crash and those that don't, to the extent we can understand it.
The version that included the extra function has these differences:  (1) a
different checksum, (2) increased offsets for entry points above the
function call (reflecting the code generated by the function call), (3) the
presence of an "imp" entry for the function call, and (4) table entries in
a different order.  Only this last is not clearly a result of the extra
function call.


We hypothesize that some table within the Windows system has a fixed
length.  Our extra module references are causing data to be written past
the end of the table, corrupting code or data used by Windows' exit code.
----------------------------------------------------------------------

/* End of text */

/* Written  3:44 pm  Oct 21, 1988 by bart@prism.TMC.COM */

We received a response from Microsoft today indicating that the problem is
caused by the lock count that is kept when DLLs are being loaded.  Windows
apparently locks a DLL whenever the DLL is referenced in any way, either
directly or indirectly.  We are exceeding a 32K limit on the number of
locks.

This is a bug in Windows but it will not be fixed soon.  Their advice
is to combine DLLs and generally try to simplify the number and pattern
of connections among DLLs.

I'll be looking into this further, and I'll try to come up with a
proposal for how we should combine libraries.

/* End of text */

/* Written  5:42 pm  Oct 26, 1988 by bart@prism.TMC.COM */

When we asked Microsoft for some further clarification on their response
to the loader bug, they did explain that the locking that is going on
here is not locking moveable objects, but a totally separate kind of
locking used to determine what DLLs need to be kept around as different
applications come and go.  They confirmed that the critical element is
links between DLLs.  However, they refused to provide any further
clarification on how they figure lock counts, stating that this was
"internal information."

Given this reply, the response we did get becomes a religious text that we
can pore over (like Talmudic scholars), though we may not need to (read my
next response).  Here it is, for the record:

======================================================================
======================================================================

8.R05        Created:  10-21-1988   Service Request ID: SRG952483473438
Associated attachment: 

----
The following is the response from the Windows developers explaining
the problem with many DLLs in Windows.  The problem is that the lock
count value for each DLL is stored in a WORD and when the lock count
goes over 32k, Windows crashes when it tries to unlock the DLLs when
the app is exitted.  As is described below, the ONLY solution is to
decrease the number of DLLs that call other DLLs.

> Subject: problem with multiple dll links
> Date: Thu Oct 20 14:59:30 1988
>
> The problem with more than 23 DLLs linking to each other
> has been found.  What it amounts to is lock count overflow on DLL
> segments caused by the locking algorithm whenever a DLL (or app
> referencing a DLL) is loaded.  You'd think that it would only be
> incremented once each time another DLL or app which uses it
> (directly or indirectly) is loaded.  Instead, each direct
> reference causes an increment, and each indirect reference
> increments it, something like a depth 1st search on a tree (except
> that the tree can have multiple cross references in the search
> (maybe its a vine?)).  Only circular references are prevented from
> causing multiple increments.  You can guess that this causes a
> count overflow fairly rapidly.
>
> Of course the other side of the coin is that there is no work
> around other than combining your DLLs.  However, grouping your
> DLLs much the same way that you would group source code into
> segments will have an effect.  For intance, have 7 DLLs which make
> calls among themselves as much as they want, but allow only 1 of
> those 7 to call a DLL outside the group.  Yes, this bug will be fixed in
> the next version of Windows.

======================================================================
======================================================================

/* End of text */

/* Written  5:42 pm  Oct 26, 1988 by bart@prism.TMC.COM */

My next step on the loader bug was to see if we could piece together from
what Microsoft had told us just what their locking algorithm was, so we
could predict its future behavior.

I wrote a little simulation program under Unix to model lock count behaviors
on different interpretations of how it works, and I believe I found the key:
when the lock count on a library other than KERNEL goes over 32767, the bug
occurs.  KERNEL appears to be some sort of exception, and I suspect there
is a hack in MS-Windows that simply doesn't do lock counts on KERNEL.  The
library that overflows first is usually SYSTEM, one of the Windows DLLs that
we never call directly.

The algorithm I used is the most straightforward one: do a depth-first search
from the application through all libraries it links and all libraries they
link, making sure only to never enter a library that is in your current
"stack" of libraries your path goes through; as each library is encountered,
up its lock count.  I tried a number of other algorithms that make less
sense to me, and they didn't yield numbers to distinguish our working cases
from our crashing cases.

[I would have thought Windows could have avoided this problem by keeping a
simple list of all libraries already encountered, instead of using a stack
of libraries in the current path, but perhaps I am missing a subtlety.  On
the other hand, perhaps their implementation was -- uh -- "less than
optimal."]

/* End of text */

/* Written  5:43 pm  Oct 26, 1988 by bart@prism.TMC.COM */

So now that we know what is going on, what can we do about it?

From working with the Windows algorithm on paper, I now know that the
really costly connections are those where a bunch of libraries call each
other.  Consider these two cases:  if the application calls 6 libraries,
each of which calls all the others and in addition calls some other library
X, X's lock counter will be 6 + (6x5) + (6x5x4) + (6x5x4x3) + (6x5x4x3x2) +
(6x5x4x3x2x1), or 1956.  If on the other hand the application calls 10
libraries, each of which calls 10 other libraries, each of which calls X,
then the lock count on X will be only 100.  The first case is not too far
off the mark from what happens within our main libraries, whereas the
second is the situation from our main libraries down to the tools
libraries.

/* End of text  */