uace0@uhnix2.uh.edu (Michael B. Vederman) (11/20/88)
I am trying to do a disk based sort, but I get really hung down by the extensive disk access. I have implemented both Quick SOrt and Shell Sort, but both seem inefficient. I cache as much as I can in memory, but that does little good for large data. Does anyone have source code, or a reference for doing disk based sorts? - mike -- for (;;) : Use ATARINET, send an interactive do_it(c_programmers); : message such as: : Tell UH-INFO at UHUPVM1 ATARINET HELP University Atari Computer Enthusiasts : University of Houston UACE
leo@philmds.UUCP (Leo de Wit) (11/22/88)
In article <701@uhnix2.uh.edu> uace0@uhnix2.uh.edu (Michael B. Vederman) writes: |I am trying to do a disk based sort, but I get really hung down by the |extensive disk access. I have implemented both Quick SOrt and Shell Sort, |but both seem inefficient. I cache as much as I can in memory, but that |does little good for large data. | |Does anyone have source code, or a reference for doing disk based sorts? Suggestion: a combination of quicksort (or shell sort) and merge sort: 1) Divide the data to be sorted into chunks that can be sorted in memory. 2) Now sort each chunk, using a fast sort algorithm such as the ones you already implemented. Both are of order n*log(n), which is as fast as you can get, assuming the data doesn't have a special property which you can take advantage of. 3) Write the output to a temporary file, one for each chunk sorted. Suggested name: sortxxxx.0, where xxxx is a 4 digit number indicating the chunk. The 0 indicates the merge level: no merging done yet. 4) Now you are left with let's say N sorted files. For each pair of files do a merge sort (I will explain later). Each pair of input files results in one (temporary) output file. The input files are deleted. Example: files sort0048.5 and sort0049.5 have as a result file sort0024.6 5) Repeat 4) until there's only one file left. This is your sorted data. Now for the merge sort: Have two pointers (or indexes) move across the data of the two files (you can alternatively use fgets, fread or read depending on what kind of data you have or how you want to process it - low or high level I/O). The record that comes first in the ordering used, is written to the output file (preferably using buffered I/O), and the corresponding pointer (index) incremented (or alternatively the next record read). This process continues until both input files have been read. If you want to be really fast, you should consider avoiding the file system and simply read complete tracks (perhaps several). In this case you have to do some programming in order to keep track of how far you read the input tracks (and how far you wrote the output tracks), or perform still other tricks when records go over buffer boundaries, or even span buffers. Finally, as an example, a (very simple) implementation of a merge routine (all buffered, using stdio), assuming each record is a physical text line, so we can use fgets to read the next record of a file. The line size is assumed not to exceed BUFSIZ (to keep the example simple). The firstone() function is supposed to return the index of which of its (char *) arguments is first in order (of course this depends on the ordering applied). In this case I assumed lexicographic ordering so strcmp can be used, and firstone implemented as a macro. It is possible to write the while loop as three separate ones: a loop using: while (!feof(infp[0]) && !feof(infp[1])) a loop using: while (!feof(infp[0]) a loop using: while (!feof(infp[1]) which can reduce testing within the loop somewhat (note however that many implementations have feof() as a macro, so there's hardly any performance penalty); I liked the compact form of the example. -------- s t a r t o f e x a m p l e -------- #include <stdio.h> #define firstone(s1,s2) ((strcmp(s1,s2) <= 0) ? 0 : 1) void merge(in0,in1,outfp) FILE *in0, *in1, *outfp; { char inbuf[2][BUFSIZ]; FILE *infp[2]; /* needed for symmetry only */ int which; infp[0] = in0; infp[1] = in1; fgets(inbuf[0],sizeof(inbuf[0]),infp[0]); fgets(inbuf[1],sizeof(inbuf[1]),infp[1]); while (!feof(infp[0]) || !feof(infp[1])) { /* not all read yet */ which = (feof(infp[0])) /* infp[0] read completely */ ? 1 /* so take buffer 1 */ : (feof(infp[1])) /* infp[1] read completely */ ? 0 /* so take buffer 0 */ : firstone(inbuf[0],inbuf[1]); /* take index of first one */ fputs(inbuf[which],outfp); fgets(inbuf[which],sizeof(inbuf[which]),infp[which]); } /* closing the files is handled elsewhere */ } -------- e n d o f e x a m p l e -------- As for literature on sorting, you should take a look at the books of D. Knuth; they're about the best on this subject (I'm sorry I can't remember any titles). Hope this helps - Leo.
tim@brspyr1.BRS.Com (Tim Northrup) (11/23/88)
From article <875@philmds.UUCP>, by leo@philmds.UUCP (Leo de Wit): | In an article uace0@uhnix2.uh.edu (Michael B. Vederman) writes: | | I am trying to do a disk based sort, ... | | As for literature on sorting, you should take a look at the books of D. | Knuth; they're about the best on this subject (I'm sorry I can't | remember any titles). Most of this is described in detail in the book: "The Art of Computer Programming", Volume 3: Searching and Sorting by Donald E. Knuth Published by Addison Wesley -- Tim Northrup +------------------------------------------+ +---------------------------------+ GEnie: T.Northrup | UUCP: uunet!steinmetz!brspyr1!tim | Air Warrior: "Duke" | ARPA: tim@brspyr1.BRS.Com +------------------------------------------+