mguyott@mirror.UUCP (Marc Guyott) (12/08/89)
I have received several requests for information on the "link reference" problem I referred to in my posting asking about any experiences that other large Windows application developers may have encountered. Because of these requests I am posting the information that we have collected about the problem that we encountered. I apologize for the format but I do not have the time to polish this up. If you are interested, you'll just have to wade through it (a lot like we did (-; ). I especially want to thank Bart Wright for allowing me to post his many informative and insightful writings dealing with this problem. The algorithm shown below summarizes what you need to watch out for. For a clearer understanding of this algorithm you should read the text listed in the portion of this article below my .signature. ---- My next step on the loader bug was to see if we could piece together from what Microsoft had told us just what their locking algorithm was, so we could predict its future behavior. I wrote a little simulation program under Unix to model lock count behaviors on different interpretations of how it works, and I believe I found the key: when the lock count on a library other than KERNEL goes over 32767, the bug occurs. KERNEL appears to be some sort of exception, and I suspect there is a hack in MS-Windows that simply doesn't do lock counts on KERNEL. The library that overflows first is usually SYSTEM, one of the Windows DLLs that we never call directly. The algorithm I used is the most straightforward one: do a depth-first search from the application through all libraries it links and all libraries they link, making sure only to never enter a library that is in your current "stack" of libraries your path goes through; as each library is encountered, up its lock count. I tried a number of other algorithms that make less sense to me, and they didn't yield numbers to distinguish our working cases from our crashing cases. ---- We originally discovered this problem while using Windows 2.03. This same problem is also found in Windows 2.11. The algorithm described above has helped us to deal effectively with this problem for the last 18 months. Marc ---- "All my life I always wanted to BE somebody. I see now I should have been more specific." Jane Wagner Marc Guyott mguyott@mirror.tmc.com {mit-eddie, pyramid, harvard!wjh12, xait}!mirror!mguyott Mirror Systems Cambridge, MA 02140 617/661-0777 /* Written 5:04 pm Aug 15, 1988 by bart@prism.TMC.COM */ /* ---------- "We're Out of Module References" ---------- */ In case some of you haven't noticed, we have a problem these days with releasing new code that makes new module references (new links between libraries). As I understand it, the short-term plan is to not release any new code, though development in your private work areas can continue. If anyone is interested, I have a set of .EXEs from the end of July that allow a fair amount of slack in module references. Here's a nearly final draft of a TAR that we will be sending to Microsoft: ---------------------------------------------------------------------- We believe we have exceeded an internal limit within the MS-Windows 2.03 run-time loader, and it seems to be a limit on the number of module references. (As used here, a "module reference" is a reference from one dynamically linked library to one or more functions within another DLL. That is, if DLL #1 calls three exported functions of DLL #2, that constitutes a single module reference. If DLL #3 calls two of the same three exported functions of DLL #2, that is another module reference. More informally, a module reference is the sort of thing that requires you to add a new DLL to the list of DLLs a library links to in the .lnk file.) We are writing a large Windows application, with about 23 DLLs (so far), not counting mlibw and mwinlibc. The total number of module references is about 175 (including 22 links from the main application to DLLs, and including links to mlibw and mwinlibc). The problem we face is a serious system crash upon exiting from our application. We are able to use the application heavily without ill effects. Regardless of whether we use it a great deal or do nothing but immediately exit after startup, the problem arises only when exiting. Note that the minimal case of entry and immediate exit does execute some of our initialization code. The symptom is that the mouse pointer will stop tracking mouse movements within a few seconds of double clicking on the close box. At that point a CTRL-ALT-DELETE has no effect; the machine must be powered down and up to restart it. We have been developing this application for several months now, without any serious problems. The problem first arose after a recent modification of two DLLs, "question" and "choose." We have done a number of experiments to try to isolate the problem. The release of the question library that first caused the crash created a new module reference to the choose library. However, if we simply comment out the question library's single instance of the choose function call, the crash does not occur. Moreover, when the line of code is included, we can surround it with message box calls which never display, indicating that it is never executed. The fact that the inclusion of a line of code that is never executed can be the critical factor in causing a crash suggests a loader or link problem and not an ordinary bug in our code. We have since found that the addition of a module reference to two completely different DLLs causes exactly the same problem: After this problematic question release was withdrawn from our working environment, a new version of our "streams" library was released that caused a new module reference to the existing "qnav" library. The same crash occurred. When the streams library's single function call to qnav was removed, the crash did not occur. As before, this line of code was not executed in the simple sequence of starting the application and then immediately exiting. We also found that if we removed the preexisting module reference from streams to a different library (question), the new code would then work with the call to qnav in it. That is, we were able to substitute the new module reference for an old one without ill effect. After withdrawing the problematic new streams library, we simply tried adding three new module references to yet another library (phman), and found that the same crash occurred. We emphasize that in all these cases, the new release did not contain any new libraries but rather new links from one library to another existing library. Moreover, adding a new function export to a library is not necessary to create the problem. It appears that some global run-time resource is being used up. The next experiment was to see what was being used up. The streams library calls several different exported functions of the phman library. We added the stream library's offending call to the qnav library, and removed the single call from streams to the PhCountRecord function (a phman function). This decreased by one the number of cross-library function references, but not the number of module references, since the other calls to the phman library remained. The system still crashed, presumably indicating that the critical resource is not cross-library function references, but the number of module references. Another interesting note is that we could link the streams library under the Windows 1.03 linker, and the exact same pair of results was observed: it worked with the call to qnav commented out, and it crashed when the function call was part of the code. The output of the EXEHDR program appears reasonable both for libraries that cause the crash and those that don't, to the extent we can understand it. The version that included the extra function has these differences: (1) a different checksum, (2) increased offsets for entry points above the function call (reflecting the code generated by the function call), (3) the presence of an "imp" entry for the function call, and (4) table entries in a different order. Only this last is not clearly a result of the extra function call. We hypothesize that some table within the Windows system has a fixed length. Our extra module references are causing data to be written past the end of the table, corrupting code or data used by Windows' exit code. ---------------------------------------------------------------------- /* End of text */ /* Written 3:44 pm Oct 21, 1988 by bart@prism.TMC.COM */ We received a response from Microsoft today indicating that the problem is caused by the lock count that is kept when DLLs are being loaded. Windows apparently locks a DLL whenever the DLL is referenced in any way, either directly or indirectly. We are exceeding a 32K limit on the number of locks. This is a bug in Windows but it will not be fixed soon. Their advice is to combine DLLs and generally try to simplify the number and pattern of connections among DLLs. I'll be looking into this further, and I'll try to come up with a proposal for how we should combine libraries. /* End of text */ /* Written 5:42 pm Oct 26, 1988 by bart@prism.TMC.COM */ When we asked Microsoft for some further clarification on their response to the loader bug, they did explain that the locking that is going on here is not locking moveable objects, but a totally separate kind of locking used to determine what DLLs need to be kept around as different applications come and go. They confirmed that the critical element is links between DLLs. However, they refused to provide any further clarification on how they figure lock counts, stating that this was "internal information." Given this reply, the response we did get becomes a religious text that we can pore over (like Talmudic scholars), though we may not need to (read my next response). Here it is, for the record: ====================================================================== ====================================================================== 8.R05 Created: 10-21-1988 Service Request ID: SRG952483473438 Associated attachment: ---- The following is the response from the Windows developers explaining the problem with many DLLs in Windows. The problem is that the lock count value for each DLL is stored in a WORD and when the lock count goes over 32k, Windows crashes when it tries to unlock the DLLs when the app is exitted. As is described below, the ONLY solution is to decrease the number of DLLs that call other DLLs. > Subject: problem with multiple dll links > Date: Thu Oct 20 14:59:30 1988 > > The problem with more than 23 DLLs linking to each other > has been found. What it amounts to is lock count overflow on DLL > segments caused by the locking algorithm whenever a DLL (or app > referencing a DLL) is loaded. You'd think that it would only be > incremented once each time another DLL or app which uses it > (directly or indirectly) is loaded. Instead, each direct > reference causes an increment, and each indirect reference > increments it, something like a depth 1st search on a tree (except > that the tree can have multiple cross references in the search > (maybe its a vine?)). Only circular references are prevented from > causing multiple increments. You can guess that this causes a > count overflow fairly rapidly. > > Of course the other side of the coin is that there is no work > around other than combining your DLLs. However, grouping your > DLLs much the same way that you would group source code into > segments will have an effect. For intance, have 7 DLLs which make > calls among themselves as much as they want, but allow only 1 of > those 7 to call a DLL outside the group. Yes, this bug will be fixed in > the next version of Windows. ====================================================================== ====================================================================== /* End of text */ /* Written 5:42 pm Oct 26, 1988 by bart@prism.TMC.COM */ My next step on the loader bug was to see if we could piece together from what Microsoft had told us just what their locking algorithm was, so we could predict its future behavior. I wrote a little simulation program under Unix to model lock count behaviors on different interpretations of how it works, and I believe I found the key: when the lock count on a library other than KERNEL goes over 32767, the bug occurs. KERNEL appears to be some sort of exception, and I suspect there is a hack in MS-Windows that simply doesn't do lock counts on KERNEL. The library that overflows first is usually SYSTEM, one of the Windows DLLs that we never call directly. The algorithm I used is the most straightforward one: do a depth-first search from the application through all libraries it links and all libraries they link, making sure only to never enter a library that is in your current "stack" of libraries your path goes through; as each library is encountered, up its lock count. I tried a number of other algorithms that make less sense to me, and they didn't yield numbers to distinguish our working cases from our crashing cases. [I would have thought Windows could have avoided this problem by keeping a simple list of all libraries already encountered, instead of using a stack of libraries in the current path, but perhaps I am missing a subtlety. On the other hand, perhaps their implementation was -- uh -- "less than optimal."] /* End of text */ /* Written 5:43 pm Oct 26, 1988 by bart@prism.TMC.COM */ So now that we know what is going on, what can we do about it? From working with the Windows algorithm on paper, I now know that the really costly connections are those where a bunch of libraries call each other. Consider these two cases: if the application calls 6 libraries, each of which calls all the others and in addition calls some other library X, X's lock counter will be 6 + (6x5) + (6x5x4) + (6x5x4x3) + (6x5x4x3x2) + (6x5x4x3x2x1), or 1956. If on the other hand the application calls 10 libraries, each of which calls 10 other libraries, each of which calls X, then the lock count on X will be only 100. The first case is not too far off the mark from what happens within our main libraries, whereas the second is the situation from our main libraries down to the tools libraries. /* End of text */