[net.lang.c] Math sorts and such

g-rh@cca.UUCP (Richard Harter) (04/18/86)
In article <> greg@utcsri.UUCP (Gregory Smith) writes:
.......
>
>Now for something completely different...
>From: g-rh@cca.UUCP (Richard Harter)
>>	Grumble, grumble.  Sorry, this is a real practical problem.
>>	Let me give some context.  Suppose you are doing a high speed
>>	sort of character strings.  Being a clever fellow you have
>>	read the glossed over parts of the literature and have seen
>>	that a math sort is O(n) rather than O(n log n).  You also
>>	notice that it is cheaper to compare 4 bytes at once with
>>	an 32-bit integer compare than to make 4 byte compares.
>
>I'm not sure on this - isn't it ridiculous to do a mathsort of 32-bit
>ints? In O(n) above ( for the mathsort) doesn't n become 2^32 rather
>than the list size? If so, it would of course be even sillier for
>larger ( i.e. reasonably-sized ) char strings.
>

	This is all getting a little far from C; however I will finish
this last point off.  The fundamental idea of a math sort is to calculate
the position that a record will occupy in the sorted file from the value
of the key.  Ideally one would want a function, f(key), whose value would
be the exact index of the record in the sorted file.  In practice one
settles for a simple function which is approximate which will yield an
almost sorted file.  The simplest such function is a straight line.
That is, one fits a straight line to the keys, considered as numbers.
One uses linear interpolation to get the corresponding index for each
key.  If you know where the record goes in the output file you simply
put it in the right place to begin with.  (This is why a math sort is
O(n) -- ignoring I/O considerations.)

	Actual implementation of a math sort is not that simple a
matter.  Given that we are using an appoximation to the ideal function
we have the problem that the calculated index will be close to but not
exactly the true index.  This can be handled by building a histogram of
calculated key values in a first pass.  In effect what we do is use a
large number of bins; for each key we calculate what bin the record
will be in.  From the histogram we can determine (via the cumulative
distribution) the starting point of the contents of each bin in the
final sorted file.  The rest of the work is an exercise for the reader.

	The usual 'objection' to math sorts is that the effectiveness
of the math sort depends on a 'reasonable' distribution of keys and
can be defeated if the values of the keys are distributed in a highly
non-uniform manner.  In practice this is not a problem for reasons
that are far too recondite to go into here.  The real problems are
(a) the cost of calculating the rough index value and (b) certain
I/O questions.

	First of all external sorts reduce to O(n log n) sorts (at
best) if you are doing an external sort.  Even if you are doing an
internal sort you will run into page faulting problems if n is large.
If your sorted file will occupy more pages than you allotted you
generate a page fault when you write to pages that aren't resident.
This problem arises when you have more bins than allotted pages.
(Comparison sorts do not run into this problem because they only
use two bins.)

	I could go on, but the issues all get very tricky.  However
the fundamental idea is the two stage mapping of character strings
into numbers and then into indexes.  The latter is simple math; the
former implies type punning.  However the payoff is distinct -- a
well written math sort will be faster than a quicksort by a large
margin.  [Knuth mentions math sorts in passing, but really doesn't
discuss the issues involved or the potential advantages.]  In
closing, to bring this all back to C, let me give you two lines of
code from a math sort:

.....
for (i=1;i<n;i++) h[*i1++ = (v[*i2++]-vn)/del]++;
.....
for (i=0;i<n;i++) y[c[*i1++]++] = *i2++;

The first line is the index calculation loop; the second is the
sort loop.  I shan't explain the code -- I just offer them as
amusing examples of moderately cryptic code.

		Richard Harter, SMDS Inc.