g-rh@cca.UUCP (Richard Harter) (07/05/85)
This is a follow up to the sorting question. To make things simple suppose that we have N=2**n keys having the values 0 to N-1. We wish to create an array of N pointers to the N records which is sorted. Assume that the array is being built in virtual memory in a standard timesharing system. The obvious way to do this is make one pass through the data, looking at the keys, and filling in the pointers in the array. Surprisingly enough, this obvious algorithm can be highly inefficient. This happens when the size of the array is being built is larger than the number of resident pages allocated to the sort process. The reason is that a high percentage of page faults are generated. (One of Knuth's problems is about this issue.) Here is an analysis: Consider a sort algorithm. The usual scheme is to use some number of bins, say B=2**b. The data is sorted in a number of passes. In each pass each key is examined once; the corresponding record (or pointer) is placed in one of the bins. This placement will generate a page fault whenever a page used by a bin is filled up and the process no longer has any free resident pages. Let A=2**a be the number of resident pages and M=2**m be the number of records per page. (I'm making everything powers of two to make things simple.) The total number of pages that must be used during the pass is N/M. A page fault will be generated for each page. (Initial filling up may be ignored.) This cost is unavoidable. Placement in bins may be followed by a merge process, depending on the particular algorithm chosen. The number of passes will be (for an efficient sort) logB(N). Thus the total number of page faults will be: (N/M)*logB(N) = (N/M)*(n/b) Call the quantity above the bin filling cost. It is evident that this cost is cut by increasing the number of bins. If, however, the number of bins is greater than the number of resident pages an additional cost is incurred because there is the chance that a needed bin will not be present when we want to put a record in it. If the keys are distributed randomly this will happen with probability (B-A)/B. This cost is incurred with each record. Call this cost the bin retrieval cost. The cost per pass is N*(B-A)/B. The number of bin retrieval page faults is: N*((B-A)/B)*(n/b) ~= N*n/b It is clear that the best choice for B is B=A if the bin retrieval cost is to be avoided. In that case the number of pages faults is (N/M)*(n/a). If B>A the total number of page faults is: N*n*[1/M + 1 -1/2**(b-a)]/b ~= N*n/b The latter expression is smaller when 1/b < 1/(M*a), that is when b>Ma. However, we must have b<n, which implies N>2**(Ma). The simple "put everything in the right place" algorithm of the first paragraph is equivalent to setting B = N/M. It is superior to a binary comparison sort if either of these two cases hold: I N < MA II N > M * 2**(Ma) In practice the latter case can only occur if the records being sorted are large and the available resident memory is small. If, for example, there is only one page available and a record fills one page direct placement is clearly superior. In all other cases the required value for N is enormous. The conclusion is that the optimal sorting method (in terms of page faults) is direct placement when ***all*** of the sorted data fits in resident unpaged memory. If this is not the case, the best sort uses a separate bin for each available page. (But you lose big if you overestimate the number of available pages.) Note 1: There are some glosses in this analysis. However the final conclusion is correct. There are limits to how much work I want to do. Note 2: In an earlier note I claimed that a math sort (i.e. estimated direct placement) is O(N). Unfortunately the page fault cost is also O(N) for all but small N. More on this anon. Richard Harter, SMDS Inc. Running at CCA