[net.math] Sorting a sorted list by a different order of keys

haahr@siemens.UUCP (Paul Haahr) (06/07/85)

Given a list sorted by keys k1, k2 is there a fast (i.e., better than
n log n) method for sorting it by keys k2, k1?  A stable sort on k2
(one that preserves order when keys are equal) or merge where each run
is the sequence of values with k1 equal are both n log n.  Is there
anything I'm missing that will take advantage of the fact that the k1's
are already ordered?

			-- Paul Haahr
			..!{allegra,ihnp4}!princeton!{jendeh,siemens}!haahr
-- 

			-- Paul Haahr
			..!{allegra,ihnp4}!princeton!{jendeh,siemens}!haahr

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (06/09/85)

> Given a list sorted by keys k1, k2 is there a fast (i.e., better than
> n log n) method for sorting it by keys k2, k1?  A stable sort on k2

A simple proof that no such general method exists:
	Consider the case where no two k2's match.
QED.

Actually, there are some sorting methods that are better than N log N;
use the best standard sorting algorithm you have.

jss@sftri.UUCP (J.S.Schwarz) (06/09/85)

>
> Given a list sorted by keys k1, k2 is there a fast (i.e., better than
> n log n) method for sorting it by keys k2, k1?  
> 			-- Paul Haahr
> 			..!{allegra,ihnp4}!princeton!{jendeh,siemens}!haahr

Not in general.  If all key values are distinct then knowing that the
list is sorted by k1 gives no information about the order when
sorted by k2.

Jerry Schwarz
ihnp4!attunix!jss

gjerawlins@watdaisy.UUCP (Gregory J.E. Rawlins) (06/11/85)

In article <11259@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) writes:
>[.....]
>Actually, there are some sorting methods that are better than N log N;
>[.....]

	Pardon me? Loose lips sink ships, please define your model.
-- 
Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo
{allegra|clyde|linus|inhp4|decvax}!watmath!watdaisy!gjerawlins

ech@spuxll.UUCP (Ned Horvath) (06/13/85)

Example of faster than nlogn sort: the keys are precisely the integers
1..n.  In other words, each key tells you what directly what cell of the
sorted array it belongs in.  The following pseudo-C then does the sort:

	for (i = 0; i < n; i++) {
		while (key[i] != i) {
			exchange (i, key[i]);
		}
	}

Note that every exchange puts (at least) one element in its final position,
so clearly the number of exchanges is upper-bounded by n-1.  Similarly
the while test can succeed at most n-1 times and fail at most n times
(total), so the whole thing is linear.

This sounds like a cheat, but indeed many 'real world' sorting problems
are of this type.  Another classical example is creating a new phone book:
I have a (very large) already sorted file -- the old phone book -- along
with a fairly short list of additions and deletions.  As a function of the
total size of the result, sorting the changes and merging is effectively
linear.

=Ned=

throopw@rtp47.UUCP (Wayne Throop) (06/13/85)

> In article <11259@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) writes:
> >Actually, there are some sorting methods that are better than N log N;
>         Pardon me? Loose lips sink ships, please define your model.
> Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo
> {allegra|clyde|linus|inhp4|decvax}!watmath!watdaisy!gjerawlins

   [I take the above to mean that the second poster doubts that there
    are sort methods faster than NlogN, so, responding to that point:]

There are indeed sorting methods that are "faster" than NlogN.  Perhaps
the simplest is "radix sorting", an example of which is the good old
card collator (well, old anyway).  The method is order N*M, where N is
the number of records, and M is the "width" of the key.  This method
works best when keys are very narrow.  In this circumstance, you can
have a "bin" for each possible key value.  Then the sort is order N...
you simply look at the key of each record, put the record in the bin
associated with it's key, and then output the contents of each bin in
turn.

Note that this method only outperforms a traditional quicksort or
heapsort when the key is "smaller than" logN in some sense, and hence
makes sense only for VERY large sets of records or VERY narrow keys.

I hope this saves all those ships in danger of submersion.  I'll keep my
lips tight, just in case.  :-)
-- 
Wayne Throop at Data General, RTP, NC
<the-known-world>!mcnc!rti-sel!rtp47!throopw

sjs@u1100s.UUCP (Stan Switzer) (06/14/85)

In article <67@rtp47.UUCP> throopw@rtp47.UUCP (Wayne Throop) writes:
> > In article <11259@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) writes:
> > >Actually, there are some sorting methods that are better than N log N;
> >         Pardon me? Loose lips sink ships, please define your model.
> > Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo
> 
>    [I take the above to mean that the second poster doubts that there
>     are sort methods faster than NlogN, so, responding to that point:]
> 
> There are indeed sorting methods that are "faster" than NlogN.  Perhaps
> the simplest is "radix sorting", an example of which is the good old
> card collator (well, old anyway).  The method is order N*M, where N is
> the number of records, and M is the "width" of the key.  This method
> works best when keys are very narrow.  In this circumstance, you can
> have a "bin" for each possible key value.  Then the sort is order N...
> you simply look at the key of each record, put the record in the bin
> associated with it's key, and then output the contents of each bin in
> turn.

BUT, unless there is heavy duplication of keys, the width if the key is
at least LOG(base radix) number of records.  So the radix sort is at least
N log N under normal circumstances.

> Note that this method only outperforms a traditional quicksort or
> heapsort when the key is "smaller than" logN in some sense, and hence
> makes sense only for VERY large sets of records or VERY narrow keys.

As I said above, this requires a heavy duplication of keys.

> I hope this saves all those ships in danger of submersion.  I'll keep my
> lips tight, just in case.  :-)
> Wayne Throop at Data General, RTP, NC
> <the-known-world>!mcnc!rti-sel!rtp47!throopw

Well you might!

P.S.  Special cases can almost always be exploited: partial sortedness,
      updates to sorted files, sparse, dense, uniform distributions, etc.

------------------------------------------------------------------------------
Stan Switzer     |  "You know, they're growing mechanical trees / They grow to
ihnp4!u1100s!sjs |   their full height and chop themselves down" - L. Anderson

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (06/18/85)

> In article <11259@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) writes:
> >[.....]
> >Actually, there are some sorting methods that are better than N log N;
> >[.....]
> 
> 	Pardon me? Loose lips sink ships, please define your model.

As just an example -- not the best possible algorithm -- consider
sorting by hashing (O(N) in time and space).  Various obvious
modifications could make this approach workable.  For practical
purposes, a good O(N log N) merge sort is just fine.

gjerawlins@watdaisy.UUCP (Gregory J.E. Rawlins) (06/20/85)

>>>Actually, there are some sorting methods that are better than N log N;
>> 
>>     Pardon me? Loose lips sink ships, please define your model.
>
>As just an example -- not the best possible algorithm -- consider
>sorting by hashing (O(N) in time and space).  Various obvious
>modifications could make this approach workable.  For practical
>purposes, a good O(N log N) merge sort is just fine.

    I try not to clutter the net with things like this but i wish
to point out that the time bound of any algorithm _depends on your model_.
That was the reason i asked for a definition of the model. I didn't
want people to go away with the notion that nlgn is beatable in
the general case where you know nothing about your data and you
are counting the minimum number of comparisons necessary and
sufficient to transform any permutation of n data items to some
(fixed) permutation of itself.
    There are only three ways to "beat" nlgn. Either you know
something about the distribution of the input prior to running
you algorithm or you decided to count something other than the
number of comparisons of two data elements, or you count something
other than the worst case possible, in all three cases you
have changed the model. Radix sort is a simple example of the 
first type of special case since it won't work unless you know
that the input consists of integers in some prespecified range.
Sorting in parallel using n processors (taking constant time)
is an example of the second. Hash sort is an example of the 
third type since you are concerned with the average case.
	Greg.
-- 
Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo
{allegra|clyde|linus|inhp4|decvax}!watmath!watdaisy!gjerawlins

throopw@rtp47.UUCP (Wayne Throop) (06/24/85)

I agree with most of what Gregory Rawlins said in the referenced
posting, but there are two points I'd like to clarify.

>>>>Actually, there are some sorting methods that are better than N log N;
>>>     Pardon me? Loose lips sink ships, please define your model.
>>As just an example -- not the best possible algorithm -- [...etc...]

>    I try not to clutter the net with things like this but i wish
>to point out that the time bound of any algorithm _depends on your model_.

Hear hear.

>That was the reason i asked for a definition of the model. I didn't
>want people to go away with the notion that nlgn is beatable in
>the general case where you know nothing about your data and you
>are counting the minimum number of comparisons necessary and
>sufficient to transform any permutation of n data items to some
>(fixed) permutation of itself.

Here is point one where I disagree and want to put in my two cents
worth.  Comparison sorting is not the "general case" of sorting.  The
general case of sorting is generating an "ordered" permutation of n data
items.  Comparison sorting "knows" something about the data items, in
that it knows there is some test on any two of them that will answer
the question "is x less-than y".  I claim that this is "knowing
something" about the data.

Thus, I think that the original poster was quite reasonable when he said
"there are some sorting methods that are better than NlogN"  He
*didn't* say "there are some comparison sorting methods better than
NlogN", which is indeed false.

>    There are only three ways to "beat" nlgn. Either you know
>something about the distribution of the input prior to running
>you algorithm or you decided to count something other than the
>number of comparisons of two data elements, or you count something
>other than the worst case possible, in all three cases you
>have changed the model.

I agree 100%, keeping in mind that what this boils down to is "to beat
NlogN you have to use something other than a comparison sort".

>                        Radix sort is a simple example of the 
>first type of special case since it won't work unless you know
>that the input consists of integers in some prespecified range.

The bit about radix sorting is where I wish to pick nit number two.
Radix sorting will work in just about any case where comparison sorting
will work.  (In fact, I suspect that is will work in *any* case where
comparison sorting will work, but I can't prove it).

All you need is a transform on the key to map it onto a bit-string that
preserves the ordering relation of the key that a comparison sort would
use, and that makes the ordering relation a "lexical" compare of the
mapped bit-strings.

For (a trivial) example, say that what you've got are keys of 20 ascii
characters that you wish to compare lexically.  The the order of the
simple radix sort I'll describe is 20N (where N is the number of
records).  (Note that this means that radix sorting will outperform
comparison sorting in this case when you have more than 10**6 records or
so.)

The radix sort of these fixed-length keys is simple.  You radix sort on
the right-most character.  Then on the next-to-rightmost character.  And
so on.  Each stage of radix sorting is stable, so when you have gotten
to the point of sorting on the leftmost byte, you have a complete sort.

As mentioned above, this technique is applicable given only that you can
map your keys into a lexically-comparable-bit-string form.  It turns out
that this can be done for many interesting cases, such as floating point
numbers, bcd (with leading sign) numbers, and so on (strings of ascii
characters are already so mapped).

The analogy of radix sorting and comparison sorting is strong.  The
general comparison sort consists of an alogorithm, and a comparison
function that will order the keys.  The general radix sort consists of
an algorithm and a mapping function that maps the keys into a bit-string
as described above.

>Sorting in parallel using n processors (taking constant time)
>is an example of the second. Hash sort is an example of the 
>third type since you are concerned with the average case.

Isn't the worst case of comparison sorting N-squared?  I thought that
the NlogN was only the "average" or "expected" case.

>Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo
>{allegra|clyde|linus|inhp4|decvax}!watmath!watdaisy!gjerawlins
-- 
Wayne Throop at Data General, RTP, NC
<the-known-world>!mcnc!rti-sel!rtp47!throopw

g-rh@cca.UUCP (Richard Harter) (06/25/85)

Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo,
writes:

>    There are only three ways to "beat" nlgn. Either you know
>something about the distribution of the input prior to running
>you algorithm or you decided to count something other than the
>number of comparisons of two data elements, or you count something
>other than the worst case possible, in all three cases you
>have changed the model. Radix sort is a simple example of the 
>first type of special case since it won't work unless you know
>that the input consists of integers in some prespecified range.
>Sorting in parallel using n processors (taking constant time)
>is an example of the second. Hash sort is an example of the 
>third type since you are concerned with the average case.

	Er, well, it is a bit more complicated than that.
Your third formulation is wrong; the basic theorem of sorting
says that the average can be no better than O(n log n).
However the theorem is a little misleading since O(n) sorts
exist for rather general classes of cases.  The basis for the
theorem is that a sort must distinguish between the n! possible
permutations.  It happens that n! is O(n**n).  The number of
binary comparisons required to distinguish the n! cases is
O(log2 n!) which is O(n log n).  This is the best possible
average case.

	The reason that the basic theorem is misleading is
that is stated in terms of binary comparisons.  This involves
two assumptions.  The first is that the binary comparison
is the appropriate element of measurement.  The second is
that the only information to be used is relative ordering
information.

	The problem with the first assumption is that the
appropriate fundamental operation is the machine operation.
All modern computers are inherently doing parallel processing
because they operate on all of the bits of a word.  The
second assumption is appropriate for the most general class
of sort where keys can be completely arbitrary; in practice,
however, additional information always exists and can be
exploited.

	A good example of an O(n) sort is the generalized
math sort.  Given n keys, assign n buckets.  Make a first
pass over the data to get the smallest and largest element.
Make a second pass over the data and use linear interpolation
on the keys to get the bucket index.  After the second pass
use any sort method which is O(n) for almost sorted data.
In practice this sort is O(n); the provisos are beyond the
scope of this article.

	Address space is important.  It can be shown (See
Knuth) that if auxilliary storage devices are needed then
sorting must be O(n log n) and that (I believe) that there
is nothing better than a merge sort.

	Finally, radix sorting is O(n log n), at best.  The
reason is that the keys must be at least log n bits long to
represent n distinct keys.

			Richard Harter, SMDS Inc.
			(Running at g-rh@cca-unix)