[comp.sys.atari.st] Disk based sorts

uace0@uhnix2.uh.edu (Michael B. Vederman) (11/20/88)

I am trying to do a disk based sort, but I get really hung down by the
extensive disk access.  I have implemented both Quick SOrt and Shell Sort,
but both seem inefficient.  I cache as much as I can in memory, but that
does little good for large data.

Does anyone have source code, or a reference for doing disk based sorts?

- mike

-- 
for (;;)                              : Use ATARINET, send an interactive
        do_it(c_programmers);         : message such as:
                                      : Tell UH-INFO at UHUPVM1 ATARINET HELP
University Atari Computer Enthusiasts : University of Houston UACE

leo@philmds.UUCP (Leo de Wit) (11/22/88)

In article <701@uhnix2.uh.edu> uace0@uhnix2.uh.edu (Michael B. Vederman) writes:
|I am trying to do a disk based sort, but I get really hung down by the
|extensive disk access.  I have implemented both Quick SOrt and Shell Sort,
|but both seem inefficient.  I cache as much as I can in memory, but that
|does little good for large data.
|
|Does anyone have source code, or a reference for doing disk based sorts?

Suggestion: a combination of quicksort (or shell sort) and merge sort:
1) Divide the data to be sorted into chunks that can be sorted in memory.
2) Now sort each chunk, using a fast sort algorithm such as the ones you
already implemented. Both are of order n*log(n), which is as fast as you
can get, assuming the data doesn't have a special property which you
can take advantage of.
3) Write the output to a temporary file, one for each chunk sorted.
Suggested name: sortxxxx.0, where xxxx is a 4 digit number indicating
the chunk.  The 0 indicates the merge level: no merging done yet.
4) Now you are left with let's say N sorted files. For each pair of
files do a merge sort (I will explain later). Each pair of input files
results in one (temporary) output file. The input files are deleted.
Example: files sort0048.5 and sort0049.5 have as a result file sort0024.6
5) Repeat 4) until there's only one file left. This is your sorted data.

Now for the merge sort:
Have two pointers (or indexes) move across the data of the two files
(you can alternatively use fgets, fread or read depending on what kind
of data you have or how you want to process it - low or high level
I/O).  The record that comes first in the ordering used, is written to
the output file (preferably using buffered I/O), and the corresponding
pointer (index) incremented (or alternatively the next record read).
This process continues until both input files have been read.
If you want to be really fast, you should consider avoiding the file
system and simply read complete tracks (perhaps several). In this case
you have to do some programming in order to keep track of how far you
read the input tracks (and how far you wrote the output tracks), or
perform still other tricks when records go over buffer boundaries, or
even span buffers.

Finally, as an example, a (very simple) implementation of a merge
routine (all buffered, using stdio), assuming each record is a physical
text line, so we can use fgets to read the next record of a file. The line
size is assumed not to exceed BUFSIZ (to keep the example simple). The
firstone() function is supposed to return the index of which of its
(char *) arguments is first in order (of course this depends on the
ordering applied). In this case I assumed lexicographic ordering so
strcmp can be used, and firstone implemented as a macro.
It is possible to write the while loop as three separate ones:

    a loop using: while (!feof(infp[0]) && !feof(infp[1]))
    a loop using: while (!feof(infp[0])
    a loop using: while (!feof(infp[1])

which can reduce testing within the loop somewhat (note however that
many implementations have feof() as a macro, so there's hardly any
performance penalty); I liked the compact form of the example.

-------- s t a r t   o f   e x a m p l e --------
#include <stdio.h>

#define firstone(s1,s2) ((strcmp(s1,s2) <= 0) ? 0 : 1)

void merge(in0,in1,outfp)
FILE *in0, *in1, *outfp;
{
    char inbuf[2][BUFSIZ];
    FILE *infp[2];                             /* needed for symmetry only */
    int which;

    infp[0] = in0; infp[1] = in1;

    fgets(inbuf[0],sizeof(inbuf[0]),infp[0]);
    fgets(inbuf[1],sizeof(inbuf[1]),infp[1]);

    while (!feof(infp[0]) || !feof(infp[1])) { /* not all read yet        */
        which = (feof(infp[0]))                /* infp[0] read completely */
                ? 1                            /* so take buffer 1        */
                : (feof(infp[1]))              /* infp[1] read completely */
                ? 0                            /* so take buffer 0        */
                : firstone(inbuf[0],inbuf[1]); /* take index of first one */
        fputs(inbuf[which],outfp);
        fgets(inbuf[which],sizeof(inbuf[which]),infp[which]);
    }
    /* closing the files is handled elsewhere */
}
-------- e n d   o f   e x a m p l e --------

As for literature on sorting, you should take a look at the books of D.
Knuth; they're about the best on this subject (I'm sorry I can't
remember any titles).

Hope this helps -
                   Leo.

tim@brspyr1.BRS.Com (Tim Northrup) (11/23/88)

From article <875@philmds.UUCP>, by leo@philmds.UUCP (Leo de Wit):
| In an article uace0@uhnix2.uh.edu (Michael B. Vederman) writes:
| | I am trying to do a disk based sort, ...
|
| As for literature on sorting, you should take a look at the books of D.
| Knuth; they're about the best on this subject (I'm sorry I can't
| remember any titles).

Most of this is described in detail in the book:

        "The Art of Computer Programming", Volume 3: Searching and Sorting

        by Donald E. Knuth
        Published by Addison Wesley
--
Tim Northrup                        +------------------------------------------+
+---------------------------------+         GEnie:  T.Northrup               |
UUCP: uunet!steinmetz!brspyr1!tim |   Air Warrior:  "Duke"                   |
ARPA: tim@brspyr1.BRS.Com          +------------------------------------------+