haahr@siemens.UUCP (Paul Haahr) (06/07/85)
Given a list sorted by keys k1, k2 is there a fast (i.e., better than n log n) method for sorting it by keys k2, k1? A stable sort on k2 (one that preserves order when keys are equal) or merge where each run is the sequence of values with k1 equal are both n log n. Is there anything I'm missing that will take advantage of the fact that the k1's are already ordered? -- Paul Haahr ..!{allegra,ihnp4}!princeton!{jendeh,siemens}!haahr -- -- Paul Haahr ..!{allegra,ihnp4}!princeton!{jendeh,siemens}!haahr
gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (06/09/85)
> Given a list sorted by keys k1, k2 is there a fast (i.e., better than > n log n) method for sorting it by keys k2, k1? A stable sort on k2 A simple proof that no such general method exists: Consider the case where no two k2's match. QED. Actually, there are some sorting methods that are better than N log N; use the best standard sorting algorithm you have.
jss@sftri.UUCP (J.S.Schwarz) (06/09/85)
> > Given a list sorted by keys k1, k2 is there a fast (i.e., better than > n log n) method for sorting it by keys k2, k1? > -- Paul Haahr > ..!{allegra,ihnp4}!princeton!{jendeh,siemens}!haahr Not in general. If all key values are distinct then knowing that the list is sorted by k1 gives no information about the order when sorted by k2. Jerry Schwarz ihnp4!attunix!jss
gjerawlins@watdaisy.UUCP (Gregory J.E. Rawlins) (06/11/85)
In article <11259@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) writes: >[.....] >Actually, there are some sorting methods that are better than N log N; >[.....] Pardon me? Loose lips sink ships, please define your model. -- Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo {allegra|clyde|linus|inhp4|decvax}!watmath!watdaisy!gjerawlins
ech@spuxll.UUCP (Ned Horvath) (06/13/85)
Example of faster than nlogn sort: the keys are precisely the integers 1..n. In other words, each key tells you what directly what cell of the sorted array it belongs in. The following pseudo-C then does the sort: for (i = 0; i < n; i++) { while (key[i] != i) { exchange (i, key[i]); } } Note that every exchange puts (at least) one element in its final position, so clearly the number of exchanges is upper-bounded by n-1. Similarly the while test can succeed at most n-1 times and fail at most n times (total), so the whole thing is linear. This sounds like a cheat, but indeed many 'real world' sorting problems are of this type. Another classical example is creating a new phone book: I have a (very large) already sorted file -- the old phone book -- along with a fairly short list of additions and deletions. As a function of the total size of the result, sorting the changes and merging is effectively linear. =Ned=
throopw@rtp47.UUCP (Wayne Throop) (06/13/85)
> In article <11259@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) writes: > >Actually, there are some sorting methods that are better than N log N; > Pardon me? Loose lips sink ships, please define your model. > Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo > {allegra|clyde|linus|inhp4|decvax}!watmath!watdaisy!gjerawlins [I take the above to mean that the second poster doubts that there are sort methods faster than NlogN, so, responding to that point:] There are indeed sorting methods that are "faster" than NlogN. Perhaps the simplest is "radix sorting", an example of which is the good old card collator (well, old anyway). The method is order N*M, where N is the number of records, and M is the "width" of the key. This method works best when keys are very narrow. In this circumstance, you can have a "bin" for each possible key value. Then the sort is order N... you simply look at the key of each record, put the record in the bin associated with it's key, and then output the contents of each bin in turn. Note that this method only outperforms a traditional quicksort or heapsort when the key is "smaller than" logN in some sense, and hence makes sense only for VERY large sets of records or VERY narrow keys. I hope this saves all those ships in danger of submersion. I'll keep my lips tight, just in case. :-) -- Wayne Throop at Data General, RTP, NC <the-known-world>!mcnc!rti-sel!rtp47!throopw
sjs@u1100s.UUCP (Stan Switzer) (06/14/85)
In article <67@rtp47.UUCP> throopw@rtp47.UUCP (Wayne Throop) writes: > > In article <11259@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) writes: > > >Actually, there are some sorting methods that are better than N log N; > > Pardon me? Loose lips sink ships, please define your model. > > Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo > > [I take the above to mean that the second poster doubts that there > are sort methods faster than NlogN, so, responding to that point:] > > There are indeed sorting methods that are "faster" than NlogN. Perhaps > the simplest is "radix sorting", an example of which is the good old > card collator (well, old anyway). The method is order N*M, where N is > the number of records, and M is the "width" of the key. This method > works best when keys are very narrow. In this circumstance, you can > have a "bin" for each possible key value. Then the sort is order N... > you simply look at the key of each record, put the record in the bin > associated with it's key, and then output the contents of each bin in > turn. BUT, unless there is heavy duplication of keys, the width if the key is at least LOG(base radix) number of records. So the radix sort is at least N log N under normal circumstances. > Note that this method only outperforms a traditional quicksort or > heapsort when the key is "smaller than" logN in some sense, and hence > makes sense only for VERY large sets of records or VERY narrow keys. As I said above, this requires a heavy duplication of keys. > I hope this saves all those ships in danger of submersion. I'll keep my > lips tight, just in case. :-) > Wayne Throop at Data General, RTP, NC > <the-known-world>!mcnc!rti-sel!rtp47!throopw Well you might! P.S. Special cases can almost always be exploited: partial sortedness, updates to sorted files, sparse, dense, uniform distributions, etc. ------------------------------------------------------------------------------ Stan Switzer | "You know, they're growing mechanical trees / They grow to ihnp4!u1100s!sjs | their full height and chop themselves down" - L. Anderson
gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (06/18/85)
> In article <11259@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) writes: > >[.....] > >Actually, there are some sorting methods that are better than N log N; > >[.....] > > Pardon me? Loose lips sink ships, please define your model. As just an example -- not the best possible algorithm -- consider sorting by hashing (O(N) in time and space). Various obvious modifications could make this approach workable. For practical purposes, a good O(N log N) merge sort is just fine.
gjerawlins@watdaisy.UUCP (Gregory J.E. Rawlins) (06/20/85)
>>>Actually, there are some sorting methods that are better than N log N; >> >> Pardon me? Loose lips sink ships, please define your model. > >As just an example -- not the best possible algorithm -- consider >sorting by hashing (O(N) in time and space). Various obvious >modifications could make this approach workable. For practical >purposes, a good O(N log N) merge sort is just fine. I try not to clutter the net with things like this but i wish to point out that the time bound of any algorithm _depends on your model_. That was the reason i asked for a definition of the model. I didn't want people to go away with the notion that nlgn is beatable in the general case where you know nothing about your data and you are counting the minimum number of comparisons necessary and sufficient to transform any permutation of n data items to some (fixed) permutation of itself. There are only three ways to "beat" nlgn. Either you know something about the distribution of the input prior to running you algorithm or you decided to count something other than the number of comparisons of two data elements, or you count something other than the worst case possible, in all three cases you have changed the model. Radix sort is a simple example of the first type of special case since it won't work unless you know that the input consists of integers in some prespecified range. Sorting in parallel using n processors (taking constant time) is an example of the second. Hash sort is an example of the third type since you are concerned with the average case. Greg. -- Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo {allegra|clyde|linus|inhp4|decvax}!watmath!watdaisy!gjerawlins
throopw@rtp47.UUCP (Wayne Throop) (06/24/85)
I agree with most of what Gregory Rawlins said in the referenced posting, but there are two points I'd like to clarify. >>>>Actually, there are some sorting methods that are better than N log N; >>> Pardon me? Loose lips sink ships, please define your model. >>As just an example -- not the best possible algorithm -- [...etc...] > I try not to clutter the net with things like this but i wish >to point out that the time bound of any algorithm _depends on your model_. Hear hear. >That was the reason i asked for a definition of the model. I didn't >want people to go away with the notion that nlgn is beatable in >the general case where you know nothing about your data and you >are counting the minimum number of comparisons necessary and >sufficient to transform any permutation of n data items to some >(fixed) permutation of itself. Here is point one where I disagree and want to put in my two cents worth. Comparison sorting is not the "general case" of sorting. The general case of sorting is generating an "ordered" permutation of n data items. Comparison sorting "knows" something about the data items, in that it knows there is some test on any two of them that will answer the question "is x less-than y". I claim that this is "knowing something" about the data. Thus, I think that the original poster was quite reasonable when he said "there are some sorting methods that are better than NlogN" He *didn't* say "there are some comparison sorting methods better than NlogN", which is indeed false. > There are only three ways to "beat" nlgn. Either you know >something about the distribution of the input prior to running >you algorithm or you decided to count something other than the >number of comparisons of two data elements, or you count something >other than the worst case possible, in all three cases you >have changed the model. I agree 100%, keeping in mind that what this boils down to is "to beat NlogN you have to use something other than a comparison sort". > Radix sort is a simple example of the >first type of special case since it won't work unless you know >that the input consists of integers in some prespecified range. The bit about radix sorting is where I wish to pick nit number two. Radix sorting will work in just about any case where comparison sorting will work. (In fact, I suspect that is will work in *any* case where comparison sorting will work, but I can't prove it). All you need is a transform on the key to map it onto a bit-string that preserves the ordering relation of the key that a comparison sort would use, and that makes the ordering relation a "lexical" compare of the mapped bit-strings. For (a trivial) example, say that what you've got are keys of 20 ascii characters that you wish to compare lexically. The the order of the simple radix sort I'll describe is 20N (where N is the number of records). (Note that this means that radix sorting will outperform comparison sorting in this case when you have more than 10**6 records or so.) The radix sort of these fixed-length keys is simple. You radix sort on the right-most character. Then on the next-to-rightmost character. And so on. Each stage of radix sorting is stable, so when you have gotten to the point of sorting on the leftmost byte, you have a complete sort. As mentioned above, this technique is applicable given only that you can map your keys into a lexically-comparable-bit-string form. It turns out that this can be done for many interesting cases, such as floating point numbers, bcd (with leading sign) numbers, and so on (strings of ascii characters are already so mapped). The analogy of radix sorting and comparison sorting is strong. The general comparison sort consists of an alogorithm, and a comparison function that will order the keys. The general radix sort consists of an algorithm and a mapping function that maps the keys into a bit-string as described above. >Sorting in parallel using n processors (taking constant time) >is an example of the second. Hash sort is an example of the >third type since you are concerned with the average case. Isn't the worst case of comparison sorting N-squared? I thought that the NlogN was only the "average" or "expected" case. >Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo >{allegra|clyde|linus|inhp4|decvax}!watmath!watdaisy!gjerawlins -- Wayne Throop at Data General, RTP, NC <the-known-world>!mcnc!rti-sel!rtp47!throopw
g-rh@cca.UUCP (Richard Harter) (06/25/85)
Gregory J.E. Rawlins, Department of Computer Science, U. Waterloo, writes: > There are only three ways to "beat" nlgn. Either you know >something about the distribution of the input prior to running >you algorithm or you decided to count something other than the >number of comparisons of two data elements, or you count something >other than the worst case possible, in all three cases you >have changed the model. Radix sort is a simple example of the >first type of special case since it won't work unless you know >that the input consists of integers in some prespecified range. >Sorting in parallel using n processors (taking constant time) >is an example of the second. Hash sort is an example of the >third type since you are concerned with the average case. Er, well, it is a bit more complicated than that. Your third formulation is wrong; the basic theorem of sorting says that the average can be no better than O(n log n). However the theorem is a little misleading since O(n) sorts exist for rather general classes of cases. The basis for the theorem is that a sort must distinguish between the n! possible permutations. It happens that n! is O(n**n). The number of binary comparisons required to distinguish the n! cases is O(log2 n!) which is O(n log n). This is the best possible average case. The reason that the basic theorem is misleading is that is stated in terms of binary comparisons. This involves two assumptions. The first is that the binary comparison is the appropriate element of measurement. The second is that the only information to be used is relative ordering information. The problem with the first assumption is that the appropriate fundamental operation is the machine operation. All modern computers are inherently doing parallel processing because they operate on all of the bits of a word. The second assumption is appropriate for the most general class of sort where keys can be completely arbitrary; in practice, however, additional information always exists and can be exploited. A good example of an O(n) sort is the generalized math sort. Given n keys, assign n buckets. Make a first pass over the data to get the smallest and largest element. Make a second pass over the data and use linear interpolation on the keys to get the bucket index. After the second pass use any sort method which is O(n) for almost sorted data. In practice this sort is O(n); the provisos are beyond the scope of this article. Address space is important. It can be shown (See Knuth) that if auxilliary storage devices are needed then sorting must be O(n log n) and that (I believe) that there is nothing better than a merge sort. Finally, radix sorting is O(n log n), at best. The reason is that the keys must be at least log n bits long to represent n distinct keys. Richard Harter, SMDS Inc. (Running at g-rh@cca-unix)