jwp@larry.sal.wisc.edu (Jeffrey W Percival) (07/07/89)
I am writing a C program (Ultrix, VAXstation 2000) to re-arrange a large disk file. The file contains a large number of fixed length records. My current method is this: read through the file, building 2 arrays in memory: array1 is an array of some attribute I want to sort on, and array2 is just array of ascending integers (record numbers in the first file). Then I sort array1, with array2 tracking the movements in array1. I end up with array2 being a mapping of record numbers between the input file and the sorted file. Finally, I loop through array2 doing a read on file1, record array2[i] and writing to file2, record i. Now, I'm not looking for help in the sorting of array1; my sorting algorithm is fast and efficient. What is taking a long time are the quasi-random seeks in file1 as file2 is sequentially built. I cannot have all of file1 in memory, but can have more than the one record I am currently using in this shuffle. How can I speed this up? -- Jeff Percival (jwp@larry.sal.wisc.edu)
dhesi@bsu-cs.bsu.edu (Rahul Dhesi) (07/08/89)
In article <205@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu (Jeffrey W Percival) writes: [how do I sort a large file that won't fit in memory?] There are many variations on the merge sort. Here is a simple one: break up the original file into N smaller files sort each smaller file merge them all -- Rahul Dhesi <dhesi@bsu-cs.bsu.edu> UUCP: ...!{iuvax,pur-ee}!bsu-cs!dhesi
ka@cbnewsh.ATT.COM (Kenneth Almquist) (07/08/89)
jwp@larry.sal.wisc.edu (Jeffrey W Percival) writes: > I am writing a C program (Ultrix, VAXstation 2000) to re-arrange a > large disk file. The file contains a large number of fixed length > records. [The program sorts the keys in memory, and then does > random accesses to the file to write the records in order.] > > I cannot have all of [the file] in memory, but can have more than the one > record I am currently using in this shuffle. How can I speed this up? I'm pretty sure that you can't speed up a shuffle operation much by keeping only part of the data in memory. Anyone with a theoretical bent want to try to prove this? I suggest you try the traditional approach of sorting chunks of the file in memory and then merging the results. This will perform more I/O than your scheme, but it will be sequential I/O. Kenneth Almquist
jwp@larry.sal.wisc.edu (Jeffrey W Percival) (07/08/89)
In article <8137@bsu-cs.bsu.edu> dhesi@bsu-cs.bsu.edu (Rahul Dhesi) writes: >In article <205@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu (Jeffrey W >Percival) writes: >[how do I sort a large file that won't fit in memory?] >There are many variations on the merge sort. Here is a simple one: Please be careful with your paraphrasing. I did not ask how to sort a large file that won't fit in memory. I said I already had a fast and efficient sort and that the sorting was already known. My question was about optimizing the process of rearranging a disk file according to a *given* mapping. One helpful person suggested reading sequentially and writing randomly, rather than vice-versa, and I tried that but it didn't help. I guess the benefit gained from using the input stream buffering was cancelled out by the effective loss of the output stream buffering. The ever-attentive and always-reliable Mr Torek suggested looking into disk cache algorithms in OS journals, and verily that I will do, with thanks to those who responded. -- Jeff Percival (jwp@larry.sal.wisc.edu)
jeffrey@algor2.UUCP (Jeffrey Kegler) (07/09/89)
In article <207@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu.UUCP (Jeffrey W Percival) writes:
=> Please be careful with your paraphrasing.
I certainly promise to try.
=> My question was about optimizing the process of rearranging a disk file
=> according to a *given* mapping.
Jeffrey P. (no relation) had implemented a suggestion made in the AT&T
Bell Laboratories Technical Journal, by J. P. Linderman, p. 258. He
extracted the keys and record locations of an unsorted file (call it
U), sorted them, and them constructed the sorted file (call it S),
only to find the random seeks involved in the last phase horrifyingly
slow.
=> One helpful person suggested reading sequentially and writing randomly,
=> rather than vice-versa,
That would have been my first thought.
=> and I tried that but it didn't help. I guess
=> the benefit gained from using the input stream buffering was canceled
=> out by the effective loss of the output stream buffering.
Oh well.
As a second try, allocate enough memory for N full length records, and
two arrays to be sorted together of N keys and N record locations. Go
through the sorted keys and find the key and locations in U of the
first N records in the *sorted* file. Sort them by record location in
U, the unsorted file, and read them in, in order by location in U,
writing them in memory in sorted order by key in the array of full
length records. Then write those records out. Repeat until all
records are written. This will involve 1 sequential pass to write
file S, and M/N sequential passes to read file U.
A further improvement is to calculate how many sequential reads cost
the same as a random seek. Call that ratio R. Whenever performing
the algorithm above would require more than R sequential reads (this
is easily determined from the difference in the record locations),
perform a seek.
My guess at R for UNIX is around 2.5 times number of records per block.
Obviously the larger N the better this will work. Note your original
try is this algorithm in the special case where N is 1. If we could
run this algorithm in terms of physical disk block instead of logical
file location, this algorithm could really hum.
Further optimizations suggest themselves, but enough already.--
Jeffrey Kegler, President, Algorists,
jeffrey@algor2.UU.NET or uunet!algor2!jeffrey
1762 Wainwright DR, Reston VA 22090
kuno@gssm.otsuka.tsukuba.JUNET (Yasushi Kuno) (07/12/89)
jwp@larry.sal.wisc.edu (Jeffrey W Percival) writes: >My question was about optimizing the process of rearranging a disk >file according to a *given* mapping. >One helpful person suggested reading sequentially and writing randomly, >rather than vice-versa, and I tried that but it didn't help. I guess >the benefit gained from using the input stream buffering was cancelled >out by the effective loss of the output stream buffering. >The ever-attentive and always-reliable Mr Torek suggested looking >into disk cache algorithms in OS journals, and verily that I will do, >with thanks to those who responded. Oh, that seems quite right, but how about this? Open N files for writing, read sequentially, and write each record to one of them so that N output files form ordered bucket... i.e. every record in output file 1 is smaller than output file 2, and so on. (You can do this because you already have mapping in the memory.) Then original problem reduce to sorting N smaller random files. Apply this recursively until each file fit into your memory. N and pass number should depend on your system. Isn't this simple? I'm curious if it works fine or not, so let me know if you have tried this. Yasushi Kuno
buck@siswat.UUCP (A. Lester Buck) (07/13/89)
In article <207@larry.sal.wisc.edu>, jwp@larry.sal.wisc.edu (Jeffrey W Percival) writes: > In article <8137@bsu-cs.bsu.edu> dhesi@bsu-cs.bsu.edu (Rahul Dhesi) writes: > >In article <205@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu (Jeffrey W > >Percival) writes: > >[how do I sort a large file that won't fit in memory?] > >There are many variations on the merge sort. Here is a simple one: > > Please be careful with your paraphrasing. I did not ask how to sort a > large file that won't fit in memory. I said I already had a fast and > efficient sort and that the sorting was already known. My question was > about optimizing the process of rearranging a disk file according to a > *given* mapping. > -- > Jeff Percival (jwp@larry.sal.wisc.edu) I was one who emailed about basic external sorting. I don't understand what you are saying above. If your problem is to sort a large file that is on external storage, then the classical methods are what you want. If you somehow have a problem where you are handed the already sorted list of pointers to records and you want to go from that stage, that is _maybe_ a different problem - I don't know because that sounds extrememly contrived. It is pointed out in many places, including Knuth Vol. 3, that it gains you next to *nothing* to get a sorted table of pointers to records when what you almost always want is the records themselves sorted. "Fast and efficient sorts" that don't order the records themselves are not of much advantage. You are just delaying the real pain of moving the records. From Knuth Vol. 3, p. 374: The external rearrangement problem which remains after [key sorting] seems trivial, at first glance; but in fact it is rather difficult, and no really good algorithms (significantly better than sorting) have yet been found. We could obviously do the rearrangement in N steps, moving one record at a time; for large enough N this is better than the N(log N) of a sorting method. But N is never that large, and N seeks are unthinkable! [...] But it seems wasteful to do a key sort only to follow it by a full sort! [Goes on to discuss a nontrivial lower bound on the number of seeks required to rearrange records on a disk file.] The only real gain of sorting the keys and not the records themselves is when you _don't_ have to move the data, as in an indexed file application. You still have to physically sort the records once if you require reasonable sequential access. I remember reading once that an estimated 40% of all mainframe cycles in the world were spent on sorting. I am sure we would all be interested in hearing about any special case you might have where your keysort is completely free, but I doubt whether you will be able to do much better than the "classical" sort/merge with replacement selection. If you do end up with a good method, you have a highly publishable (and patentable!) piece of work. -- A. Lester Buck ...!texbell!moray!siswat!buck