[net.math] Page fault costs in sorting

g-rh@cca.UUCP (Richard Harter) (07/05/85)
	This is a follow up to the sorting question.  To make things
simple suppose that we have N=2**n keys having the values 0 to N-1.
We wish to create an array of N pointers to the N records which is
sorted.  Assume that the array is being built in virtual memory in
a standard timesharing system.  The obvious way to do this is make
one pass through the data, looking at the keys, and filling in the
pointers in the array.  Surprisingly enough, this obvious algorithm
can be highly inefficient.  This happens when the size of the array
is being built is larger than the number of resident pages allocated
to the sort process.  The reason is that a high percentage of page
faults are generated.  (One of Knuth's problems is about this issue.)
Here is an analysis:

	Consider a sort algorithm.  The usual scheme is to use some
number of bins, say B=2**b.  The data is sorted in a number of passes.
In each pass each key is examined once; the corresponding record
(or pointer) is placed in one of the bins.  This placement will
generate a page fault whenever a page used by a bin is filled up
and the process no longer has any free resident pages.  Let A=2**a
be the number of resident pages and M=2**m be the number of records
per page.  (I'm making everything powers of two to make things
simple.)  The total number of pages that must be used during the
pass is N/M.  A page fault will be generated for each page.
(Initial filling up may be ignored.)  This cost is unavoidable.
Placement in bins may be followed by a merge process, depending
on the particular algorithm chosen.  The number of passes will be
(for an efficient sort) logB(N).  Thus the total number of page
faults will be:

		(N/M)*logB(N) = (N/M)*(n/b)

	Call the quantity above the bin filling cost.  It is
evident that this cost is cut by increasing the number of bins.
If, however, the number of bins is greater than the number of
resident pages an additional cost is incurred because there is
the chance that a needed bin will not be present when we want to
put a record in it.  If the keys are distributed randomly this
will happen with probability (B-A)/B.  This cost is incurred
with each record.  Call this cost the bin retrieval cost.
The cost per pass is N*(B-A)/B.  The number of bin retrieval
page faults is:

		N*((B-A)/B)*(n/b) ~= N*n/b

	It is clear that the best choice for B is B=A if the
bin retrieval cost is to be avoided.  In that case the number
of pages faults is (N/M)*(n/a).  If B>A the total number of
page faults is:

		N*n*[1/M + 1 -1/2**(b-a)]/b ~= N*n/b

The latter expression is smaller when 1/b < 1/(M*a), that is
when b>Ma.  However, we must have b<n, which implies N>2**(Ma).
The simple "put everything in the right place" algorithm of the
first paragraph is equivalent to setting B = N/M.  It is superior
to a binary comparison sort if either of these two cases hold:

	I	N < MA
	II	N > M * 2**(Ma)

In practice the latter case can only occur if the records being
sorted are large and the available resident memory is small.
If, for example, there is only one page available and a record 
fills one page direct placement is clearly superior.  In all other
cases the required value for N is enormous.

The conclusion is that the optimal sorting method (in terms of
page faults) is direct placement when ***all*** of the sorted
data fits in resident unpaged memory.  If this is not the case,
the best sort uses a separate bin for each available page.  (But
you lose big if you overestimate the number of available pages.)

Note 1:	There are some glosses in this analysis.  However the final
conclusion is correct.  There are limits to how much work I want to
do.

Note 2: In an earlier note I claimed that a math sort (i.e. estimated
direct placement) is O(N).  Unfortunately the page fault cost is also
O(N) for all but small N.  More on this anon.

			Richard Harter, SMDS Inc.
				Running at CCA