[comp.lang.c] Fast, stable, free, small sorts?

sbw@naucse.UUCP (09/18/87)

Can anyone point me to published/public domain stable sorting
algorithms that are reasonably time and space efficient?

I have an NlogN stable sort that eats memory like cupcakes, and
a stable sort that requires only N+C memory that barely beats
a bubble sort.  Every try at something inbetween has resulted
in what I consider failure - typically exemplified by no real
gain in speed over my slow one.  I know that it can be down
in NlogN^2 time, and that Macro SPITBOL uses a stable heapsort,
but I don't want to spend anymore of my time on it.

C code would be a nice way to describe the algorithm.

Please Email me any suggestsions.

	Thanks!
	Steve Wampler
	{...!arizona!naucse!sbw}

g-rh@cca.CCA.COM (Richard Harter) (09/18/87)

In article <460@naucse.UUCP> sbw@naucse.UUCP (Steve Wampler) writes:
>Can anyone point me to published/public domain stable sorting
>algorithms that are reasonably time and space efficient?
>
>I have an NlogN stable sort that eats memory like cupcakes, and
>a stable sort that requires only N+C memory that barely beats
>a bubble sort.  Every try at something inbetween has resulted
>in what I consider failure - typically exemplified by no real
>gain in speed over my slow one.  I know that it can be down
>in NlogN^2 time, and that Macro SPITBOL uses a stable heapsort,
>but I don't want to spend anymore of my time on it.
>
>C code would be a nice way to describe the algorithm.
>
>Please Email me any suggestsions.

	Your requirements are vague, and it makes a real difference.
Are your talking about sorting fixed length or variable length
records?  Is the sort to be in place or is it to produce an output
array (file) with the input untouched.  Should it handle arbitrary
lengths?

	If an address calculation sort is admissable the sort can be
done O(N) with O(N) additional memory.  If O(log N) additional memory
is acceptable and an in place sort for fixed records is desired it can
be done stably in O(N log N) time.  If record sizes are variable things
are tricky.

	The best single reference that I know of is Knuth's volume III
on sorting.  I am not entirely happy with it -- there are important
techniques that are only lightly touched -- but it is comprehensive
and it does give the algorithms.

	May I suggest that you repost your request with more detail:

(a)	Is the data in memory or in a file?  If the latter are we talking
	about sorts of arbitrary length data?  If so are we talking about
	disk sorts or tape sorts?

(b)	Are you using fixed length records, or variable length records?

(c)	Are memory constraints are (1) constant, (2) log N, or (3) N?

(d)	Is it a single-key sort or multi-key?

(e)	Is address calculation admissable or not admissable.  [In address
	calculation sorts you derive a function on the key that gives
	the approximate address of a record in the sorted file.]

(f)	Is it fixed key-length or variable key length?
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

wcs@ho95e.ATT.COM (Bill.Stewart) (09/19/87)

In article <460@naucse.UUCP> sbw@naucse.UUCP (Steve Wampler) writes:
:Can anyone point me to published/public domain stable sorting
:algorithms that are reasonably time and space efficient?
:I have [big-memory fast and small-memory slow algorithms]....
	If memory serves me correctly, isn't the Shell sort stable?
	Check out the volume of Knuth that deals with sorting and
	searching.  The analysis of run-time is quite interesting,
	but basically it's around nlogn or maybe n**1.5, and it's
	about as small and simple as bubble-sort.

:C code would be a nice way to describe the algorithm.
	Well, you won't find C code in Knuth; there's "English" and
	probably MIX (not even ALGOL, which was the politically correct
	language to describe algorithms in at the time.)  But it's not
	that hard to translate into C.
-- 
#				Thanks;
# Bill Stewart, AT&T Bell Labs 2G218, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wc(

lmiller@venera.isi.edu (Larry Miller) (09/19/87)

In article <460@naucse.UUCP> sbw@naucse.UUCP (Steve Wampler) writes:
>Can anyone point me to published/public domain stable sorting
>algorithms that are reasonably time and space efficient?

Why is STABLE necessary?  Most any NlogN routine can be written to
swap pointers rather than the data itself, so a stable routine shouldn't
be that important.  Are you looking for absolutely fastest routine?  But
that is very much data dependent.  Quick sort is not stable, but Shell sort
is, and is almost as fast for reasonable data sizes, so you might want to 
look at that.

Larry Miller
(lmiller@venera.isi.edu  --  no uucp)

sbw@naucse.UUCP (Steve Wampler) (09/20/87)

It was requested that I provide more information on my needs, so:

I want a stable, minimum storage sort that takes exactly the
same arguments as qsort (section 3 of the UNIX man pages).
It should be reasonably fast (i.e. not N^2).

Now, I have some requests of my own.  I'm sorry if this offends
anyone, but it's rather bothersome to have to wade through the
pages of misinformation I've been getting.

Please don't respond unless you:

	(1) Know what 'stable' means in terms of sorts.

	(2) Know what 'minimum storage' means.

	(3) Know of a fast algorithm that meets those
	    criteria.

None of the following basic sorting techniques meets (1) and/or (2)
above:

	Shell, Quick, Heap.

Any sort can be made stable, the question is at what cost.

Please don't quote timings from Knuth for me.  For that matter,
please don't refer me to Knuth.  While it is the best reference
around for sorting techniques, the discussion of stableness is
minimal, being confined to exercises throughout the text.  He
does give a hint on writing a NlogNlogN minimal storage stable
sort, but that's all, a hint.  I am well aware of what Knuth
and the majority of data structure texts have to say about stable
sorts.

The only published (and reasonably fast) stable sorting algorithm
I know about runs in N^(3/2) times (excuse me, stable *and* requiring
minimal storage).  I am hoping for something faster.

Remember, in my original posting, I stated that I have an NlogN
stable sort that isn't minimal storage (it takes N+N storage, as
do most of the obvious such beasties).  It's rather trivial to
modify a quick sort to do that (it's even more trivial to modify
a merge sort to be stable - but not to have be minimal in storage).

Please don't offer 'solutions' unless you know them to be just that.

Thanks.

g-rh@cca.CCA.COM (Richard Harter) (09/20/87)

In article <3603@venera.isi.edu> lmiller@venera.isi.edu.UUCP (Larry Miller) writes:
>
>Why is STABLE necessary?  Most any NlogN routine can be written to
>swap pointers rather than the data itself, so a stable routine shouldn't
>be that important...

	STABLE:  A sort is stable if all records with the same key are
	in the same order in the output file as they were in the input
	file.

Stability is essential for sorting on multiple keys.  There are many other
applications where stability is desirable.  Using swap pointers does not
resolve the problem [I will leave it as an exercise for the reader to
determine why not.]

Incidentally, quicksort can be converted into a stable sort at a price
in time without marker bits.  The trick is to make a first pass to
calculate the size of the partitions and load each partition from the
front.
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

g-rh@cca.CCA.COM (Richard Harter) (09/22/87)

In article <461@naucse.UUCP> sbw@naucse.UUCP (Steve Wampler) writes:
>
>It was requested that I provide more information on my needs, so:
>
>I want a stable, minimum storage sort that takes exactly the
>same arguments as qsort (section 3 of the UNIX man pages).
>It should be reasonably fast (i.e. not N^2).

I wish I could help you more.  I don't have any references at
hand that meet your needs.  I could write such a sort, but I
won't do it for free, which doesn't help you.  The following
may be of use:

If I'm not missing anything these are the requirments that you want met:

(1)	Stable
(2)	At least O(N log^2(N)) in time and preferably better
(3)	Fixed internal storage usage (O(1) additional storage)
(4)	Fixed record size specified in input.
(5)	Comparison function passed through calling sequence.
(6)	Data in memory, referenced by calling sequence
(7)	Sort in place.

I can think of two ways to meet them, offhand.  One is a modified
quicksort, sketched out below.  The other is a modified mergesort.
There is a trick for doing a mergesort in place.  I haven't thought
it through thoroughly so I will defer that one.

Quicksort can be converted into a stable, O(N log N) time, O(1) space
sort.  The price is added complexity and a larger time factor.
The essential elements are (a) altering the partition algorithm
to be stable, and (b) forcing the partition element to be the
median.

	Stabilizing the quicksort partition

The quicksort partition algorithm is not stable.  To make it stable
we make a first pass over the data and count the number of elements
less than, equal to, and greater than the test value.  We now know
the boundaries of the three segments.  We make a second pass and
count the number in each category within each segment.  This gives
us a total of 9 sub-segments.  Each sub-segment is filled from the
front, ping-pong style.  I.e. there is a separate index into each
sub-segment; when we are placing an element we put into the sub-segment
corresponding to its segment origin and destination and pop the element
contained there-in.  This is a stable, O(N) partition.

1)  The fact that it takes three passes rather than one does not alter
the fact that it is O(N).

2)  This is the general partition algorithm for arbitrary test value.

3)  Coding this is not trivial.

	Forcing the median as partition element

Quicksort, in recursive form, is O(1) in space for each invocation;
however the number of invocations is log N, whereby the frame space
ends up being O(log N).  When we convert quicksort to nonrecursive
form we end up with a stack of begin-end indices of length log(N).

We get rid of this stack by forcing the partition element to be the
median of the data set.  We no longer need the begin-end stack because
all begin-end indices may be calculated explicitly.  [This is not
entirely trivial -- there is some nasty fiddling with odds and evens
to do, but it can be done.]

We may use the standard 'find-median' algorithm (with the stabilized
partition algorithm).  This is still O(N) but has a bad worst case.
We can improve the worst case by a variant of the median of medians
trick (which can't be used directly because of the storage constraint.)

Summary:  Quicksort can be converted into a stable partition sort which
is time O(N log N) on average and additional space O(1).
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

franka@mmintl.UUCP (Frank Adams) (09/22/87)

In article <1719@ho95e.ATT.COM> wcs@ho95e.UUCP (46133-Bill.Stewart,2G218,x0705,) writes:
>If memory serves me correctly, isn't the Shell sort stable?

No, it isn't.  The best O(n log n) stable sort I know of requires O(n)
additional memory.
-- 

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

RMRichardson.PA@Xerox.COM (Rich) (09/22/87)

From: "Bill.Stewart" <wcs@ho95e.ATT.COM>
> If memory serves me correctly, isn't the Shell sort stable?
> Check out the volume of Knuth that deals with sorting and
> searching.  The analysis of run-time is quite interesting,
> but basically it's around nlogn or maybe n**1.5, and it's
> about as small and simple as bubble-sort.

No, the Shell sort is not *necessarily* stable.  The Shell sort is, 
in fact, a series of "sort by insertion" passes over the data, each 
sort being stable.  However, the different passes interleave the 
strings of data such that you cannot guarantee the final result will 
be stable.  

Note: the last pass of a Shell sort is a full sort by insertion 
looking at each item in the list.  It is more efficient than normally
because the previous passes have put the total list in "almost" sorted 
order.  

Rich

wfp@dasys1.UUCP (William Phillips) (09/22/87)

In article <1719@ho95e.ATT.COM>, wcs@ho95e.ATT.COM (Bill.Stewart) writes:
> In article <460@naucse.UUCP> sbw@naucse.UUCP (Steve Wampler) writes:
> :Can anyone point me to published/public domain stable sorting
> :algorithms that are reasonably time and space efficient?
> :I have [big-memory fast and small-memory slow algorithms]....
> 	If memory serves me correctly, isn't the Shell sort stable?
> 	....
> 
> :C code would be a nice way to describe the algorithm.
> 	Well, you won't find C code in Knuth; there's "English" and
> 	probably MIX (not even ALGOL, which was the politically correct
> 	language to describe algorithms in at the time.)  But it's not
> 	that hard to translate into C.

It's even easier to translate Ratfor (or Pascal) into C,
and there is a very workable Shell sort in _Software Tools_
(or _Software Tools in Pascal_) by Kernighan and Plauger.

I translated it into assembler for a mini back in
1977, and it was still in use on that system in 1987
(a very durable sort, I think)!  The Shell sort is easy
to understand (therefore to debug), reasonably
fast (not the fastest, but respectable), and about as 
compact as you could want.

-- 
William Phillips                 {allegra,philabs,cmcl2}!phri\
Big Electric Cat Public Unix           {bellcore,cmcl2}!cucard!dasys1!wfp
New York, NY, USA                (-: Just say "NO" to OS/2! :-)

rwhite@nusdhub.UUCP (Robert C. White Jr.) (09/23/87)

In article <20029@cca.CCA.COM>, g-rh@cca.CCA.COM (Richard Harter) writes:
> 	STABLE:  A sort is stable if all records with the same key are
> 	in the same order in the output file as they were in the input
> 	file.


	Use something we called duplicity:

	make an array of pointers to sctuctures....
	each structure contains a pointer to a copy of
		the key value and a pointer to the same
		type of sctucture [recursive].
	read each value into a structure with any duplicates
		being pushed into a linked list built on the
		last element of the sctucture [the pointer]
	pointer-sort the aray by your favorite method.
	write each aray element in order to the output with
		any linked list entries printed in linked order.

The value must have either refrences to record numbers [disk-to-disk]
or aray indices which represent the full data record.  This sort
list header method is fast and easy to implement by any fast and dirty
sort algorythm.  the linked lists provide sthe stability.

Rob ("so who asked you anyway?") white.

lmiller@venera.isi.edu.UUCP (09/23/87)

In article <20029@cca.CCA.COM> g-rh@CCA.CCA.COM.UUCP (Richard Harter) writes:
>In article <3603@venera.isi.edu> lmiller@venera.isi.edu.UUCP (Larry Miller) writes:
>>
>>Why is STABLE necessary?  Most any NlogN routine can be written to
>>swap pointers rather than the data itself, so a stable routine shouldn't
>>be that important...
>
>Stability is essential for sorting on multiple keys.
              ^^^^^^^^^

Oh come on.  Any first year data structures/algorithms student could devise
a routine that will sort on multiple keys, and the sort does not have to be
stable.

Larry

sbw@naucse.UUCP (Steve Wampler) (09/25/87)

In article <3636@venera.isi.edu>, lmiller@venera.isi.edu (Larry Miller) writes:
> In article <20029@cca.CCA.COM> g-rh@CCA.CCA.COM.UUCP (Richard Harter) writes:
> >
> >Stability is essential for sorting on multiple keys.
>               ^^^^^^^^^
> Oh come on.  Any first year data structures/algorithms student could devise
> a routine that will sort on multiple keys, and the sort does not have to be
> stable.

True.  However, without stability the sort must know it is sorting multiple
fields (that's way sort(1) on Unix has those options).  If it's to be be
more general purpose, then so far as I can tell, it must be stable.  I would
love to be shown wrong on this, as I seem to be incapable of writing a minimal
storage stable sort that is fast on large data sets.

btb@ncoast.UUCP (10/03/87)

i happen to think that stability is a very useful property to have
(at least as an option) in a sorting algorithm... coming from a vms 
environment, which wasn't pretty, but at least was humungous enough to
provide "LOTS OF COMMAND OPTIONS" particularly with respect to its
directory function and sort function, the SORT utility under MSDOS is
a big let down... it may be fast, but it doesn't know about multiple
sort keys and it IS NOT STABLE which is a pain when you are trying to 
maintain directories containing 200+ files... the speed isn't so critical
(using the sort utility to sort directory listings (dir | sort /+n...),
but the lack of stability makes it impossible to chain sorts together to
get a stream sorted the way you need it...
	you might say that this is a mistake that microsoft made, and i
might agree with you, but since i am primarily a scientific applications
programmer, writing a do it all sort is not within the scope of my 
priorities or available time (plus, since i must work in fortran, it is
not something i would want to do in that language anyway).
	a few months back, i participated in a similar discussion, and one
guy said that he was able to make a small change to his quicksort routine
to allow it to sort with stability... anyone have any comments?
somebody else commented that shellsort is not necessarily stable, but i
don't understand why not, since essentially if you don't swap on equality
things should stay in order, shouldn't they?

-- 
			Brad Banko
			Columbus, Ohio
			(formerly ...!decvax!cwruecmp!ncoast!btb)
			btb%ncoast@mandrill.cwru.edu

"The only thing we have to fear on this planet is man."
			-- Carl Jung, 1875-1961

franka@mmintl.UUCP (Frank Adams) (10/06/87)

In article <4791@ncoast.UUCP> btb@ncoast.UUCP (Brad Banko) writes:
>somebody else commented that shellsort is not necessarily stable, but i
>don't understand why not, since essentially if you don't swap on equality
>things should stay in order, shouldn't they?

The problem is not with equality of the compared elements, but equality of
one of them with an intermediate element.  Suppose we have a 3 element
array, with keys, in order, 2, 1, and 1.  We do a Shell sort with an initial
increment of 2.  The 2 is compared with the final 1, found larger, and
swapped.  Now the 1's are out of order.
-- 

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

levy@ttrdc.UUCP (Daniel R. Levy) (10/06/87)

In article <463@naucse.UUCP>, sbw@naucse.UUCP (Steve Wampler) writes:
> If it's to be be
> more general purpose, then so far as I can tell, it must be stable.  I would
> love to be shown wrong on this, as I seem to be incapable of writing a minimal
> storage stable sort that is fast on large data sets.

I'm a hacker, not a CS guru, so I might be full of hooey on this.  But, how
about using the standard UNIX qsort(3) routine with a compare() function which,
when all else is equal, ranks the items according to their original position
in the list?  Something like

struct datum {
	TYPE item;
	int position;
} ITEMS[NITEMS];

	...  /* fill the array with data */

int i;
for (i=0; i<NITEMS; i++)
	ITEMS[i].position = i;

int compare();

qsort((char *)ITEMS, NITEMS, sizeof(struct datum), compare);

...

compare(a,b)
char *a, *b;
{
	int i;
	i = DIFFERENCE(((struct datum *)a) -> item,
		((struct datum *)b) -> item);
	if (i)	/* <0 or >0 */
		return i;
	else
		return ((struct datum *)a) -> position -
			((struct datum *)b) -> position;	/* stable */
}

The catch of course is that you need to store NITEMS int values, as well
as slowing down the swap operations since the datum size is larger than
if it comprised only the data of interest.  But the order of time and
storage in terms of NITEMS does not change.
-- 
|------------Dan Levy------------|  Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa,
|         an Engihacker @        |		vax135}!ttrdc!ttrda!levy
| AT&T Computer Systems Division |  Disclaimer?  Huh?  What disclaimer???
|--------Skokie, Illinois--------|

sbw@naucse.UUCP (Steve Wampler) (10/07/87)

In article <4791@ncoast.UUCP>, btb@ncoast.UUCP (Brad Banko) writes:
> somebody else commented that shellsort is not necessarily stable, but i
> don't understand why not, since essentially if you don't swap on equality
> things should stay in order, shouldn't they?

The problem arises from the fact that in all but the last pass of the
Shell sort, you are comparing elements that aren't adjacent.  Exchanging
two non-adjacent elements runs the risk that you are changing the
relative positions of one of them with a equal-keyed element inbetween.

For example, consider the following comparison of e1 and e2:

	3 1 4 1 5 9
        ^     ^
       e1    e2

Exchanging them yields:

	1 1 4 3 5 9
        ^     ^
       e2    e1

but you've just changed the relative positions of the two elements
keyed with value '1'.