[comp.unix.wizards] Algorithm needed: reading/writing a large file

jwp@larry.sal.wisc.edu (Jeffrey W Percival) (07/07/89)

I am writing a C program (Ultrix, VAXstation 2000) to re-arrange a
large disk file.  The file contains a large number of fixed length
records.  My current method is this: read through the file, building 2
arrays in memory:  array1 is an array of some attribute I want to sort
on, and array2 is just array of ascending integers (record numbers in
the first file).  Then I sort array1, with array2 tracking the
movements in array1.  I end up with array2 being a mapping of record
numbers between the input file and the sorted file.  Finally, I loop
through array2 doing a read on file1, record array2[i] and writing to
file2, record i.

Now, I'm not looking for help in the sorting of array1;  my sorting
algorithm is fast and efficient.  What is taking a long time are the
quasi-random seeks in file1 as file2 is sequentially built.

I cannot have all of file1 in memory, but can have more than the one
record I am currently using in this shuffle.  How can I speed this up?
-- 
Jeff Percival (jwp@larry.sal.wisc.edu)

dhesi@bsu-cs.bsu.edu (Rahul Dhesi) (07/08/89)

In article <205@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu (Jeffrey W
Percival) writes:
[how do I sort a large file that won't fit in memory?]

There are many variations on the merge sort.  Here is a simple one:

     break up the original file into N smaller files
     sort each smaller file
     merge them all
-- 
Rahul Dhesi <dhesi@bsu-cs.bsu.edu>
UUCP:    ...!{iuvax,pur-ee}!bsu-cs!dhesi

ka@cbnewsh.ATT.COM (Kenneth Almquist) (07/08/89)

jwp@larry.sal.wisc.edu (Jeffrey W Percival) writes:
> I am writing a C program (Ultrix, VAXstation 2000) to re-arrange a
> large disk file.  The file contains a large number of fixed length
> records.  [The program sorts the keys in memory, and then does
> random accesses to the file to write the records in order.]
>
> I cannot have all of [the file] in memory, but can have more than the one
> record I am currently using in this shuffle.  How can I speed this up?

I'm pretty sure that you can't speed up a shuffle operation much by
keeping only part of the data in memory.  Anyone with a theoretical
bent want to try to prove this?

I suggest you try the traditional approach of sorting chunks of the
file in memory and then merging the results.  This will perform more
I/O than your scheme, but it will be sequential I/O.
				Kenneth Almquist

jwp@larry.sal.wisc.edu (Jeffrey W Percival) (07/08/89)

In article <8137@bsu-cs.bsu.edu> dhesi@bsu-cs.bsu.edu (Rahul Dhesi) writes:
>In article <205@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu (Jeffrey W
>Percival) writes:
>[how do I sort a large file that won't fit in memory?]
>There are many variations on the merge sort.  Here is a simple one:

Please be careful with your paraphrasing.  I did not ask how to sort a
large file that won't fit in memory.  I said I already had a fast and
efficient sort and that the sorting was already known.  My question was
about optimizing the process of rearranging a disk file according to a
*given* mapping.

One helpful person suggested reading sequentially and writing randomly,
rather than vice-versa, and I tried that but it didn't help.  I guess
the benefit gained from using the input stream buffering was cancelled
out by the effective loss of the output stream buffering.

The ever-attentive and always-reliable Mr Torek suggested looking
into disk cache algorithms in OS journals, and verily that I will do,
with thanks to those who responded.
-- 
Jeff Percival (jwp@larry.sal.wisc.edu)

jeffrey@algor2.UUCP (Jeffrey Kegler) (07/09/89)

In article <207@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu.UUCP (Jeffrey W Percival) writes:

=> Please be careful with your paraphrasing.

I certainly promise to try.

=>  My question was about optimizing the process of rearranging a disk file
=> according to a *given* mapping.

Jeffrey P. (no relation) had implemented a suggestion made in the AT&T
Bell Laboratories Technical Journal, by J. P. Linderman, p. 258.  He
extracted the keys and record locations of an unsorted file (call it
U), sorted them, and them constructed the sorted file (call it S),
only to find the random seeks involved in the last phase horrifyingly
slow.

=> One helpful person suggested reading sequentially and writing randomly,
=> rather than vice-versa,

That would have been my first thought.

=> and I tried that but it didn't help.  I guess
=> the benefit gained from using the input stream buffering was canceled
=> out by the effective loss of the output stream buffering.

Oh well.

As a second try, allocate enough memory for N full length records, and
two arrays to be sorted together of N keys and N record locations.  Go
through the sorted keys and find the key and locations in U of the
first N records in the *sorted* file.  Sort them by record location in
U, the unsorted file, and read them in, in order by location in U,
writing them in memory in sorted order by key in the array of full
length records.  Then write those records out.  Repeat until all
records are written.  This will involve 1 sequential pass to write
file S, and M/N sequential passes to read file U.

A further improvement is to calculate how many sequential reads cost
the same as a random seek.  Call that ratio R.  Whenever performing
the algorithm above would require more than R sequential reads (this
is easily determined from the difference in the record locations),
perform a seek.

My guess at R for UNIX is around 2.5 times number of records per block.
Obviously the larger N the better this will work.  Note your original
try is this algorithm in the special case where N is 1.  If we could
run this algorithm in terms of physical disk block instead of logical
file location, this algorithm could really hum.

Further optimizations suggest themselves, but enough already.-- 

Jeffrey Kegler, President, Algorists,
jeffrey@algor2.UU.NET or uunet!algor2!jeffrey
1762 Wainwright DR, Reston VA 22090

kuno@gssm.otsuka.tsukuba.JUNET (Yasushi Kuno) (07/12/89)

jwp@larry.sal.wisc.edu (Jeffrey W Percival) writes:

>My question was about optimizing the process of rearranging a disk
>file according to a *given* mapping.

>One helpful person suggested reading sequentially and writing randomly,
>rather than vice-versa, and I tried that but it didn't help.  I guess
>the benefit gained from using the input stream buffering was cancelled
>out by the effective loss of the output stream buffering.

>The ever-attentive and always-reliable Mr Torek suggested looking
>into disk cache algorithms in OS journals, and verily that I will do,
>with thanks to those who responded.

Oh, that seems quite right, but how about this?  Open N files for
writing, read sequentially, and write each record to one of them so
that N output files form ordered bucket... i.e. every record in output
file 1 is smaller than output file 2, and so on.  (You can do this
because you already have mapping in the memory.)  Then original
problem reduce to sorting N smaller random files.  Apply this
recursively until each file fit into your memory.  N and pass number
should depend on your system.  Isn't this simple?  I'm curious if it
works fine or not, so let me know if you have tried this.

							Yasushi Kuno

buck@siswat.UUCP (A. Lester Buck) (07/13/89)

In article <207@larry.sal.wisc.edu>, jwp@larry.sal.wisc.edu (Jeffrey W Percival) writes:
> In article <8137@bsu-cs.bsu.edu> dhesi@bsu-cs.bsu.edu (Rahul Dhesi) writes:
> >In article <205@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu (Jeffrey W
> >Percival) writes:
> >[how do I sort a large file that won't fit in memory?]
> >There are many variations on the merge sort.  Here is a simple one:
> 
> Please be careful with your paraphrasing.  I did not ask how to sort a
> large file that won't fit in memory.  I said I already had a fast and
> efficient sort and that the sorting was already known.  My question was
> about optimizing the process of rearranging a disk file according to a
> *given* mapping.
> -- 
> Jeff Percival (jwp@larry.sal.wisc.edu)

I was one who emailed about basic external sorting.  I don't understand what
you are saying above.  If your problem is to sort a large file that is on
external storage, then the classical methods are what you want.  If you
somehow have a problem where you are handed the already sorted list of
pointers to records and you want to go from that stage, that is _maybe_ a
different problem - I don't know because that sounds extrememly contrived.
It is pointed out in many places, including Knuth Vol. 3, that it gains you
next to *nothing* to get a sorted table of pointers to records when what you
almost always want is the records themselves sorted.  "Fast and efficient
sorts" that don't order the records themselves are not of much advantage.
You are just delaying the real pain of moving the records.

From Knuth Vol. 3, p. 374:

    The external rearrangement problem which remains after [key sorting]
    seems trivial, at first glance; but in fact it is rather difficult,
    and no really good algorithms (significantly better than sorting)
    have yet been found.  We could obviously do the rearrangement in
    N steps, moving one record at a time; for large enough N this is
    better than the N(log N) of a sorting method.  But N is never that
    large, and N seeks are unthinkable!
    [...]
    But it seems wasteful to do a key sort only to follow it by a
    full sort! 
    [Goes on to discuss a nontrivial lower bound on the number
    of seeks required to rearrange records on a disk file.]

The only real gain of sorting the keys and not the records themselves is
when you _don't_ have to move the data, as in an indexed file application.
You still have to physically sort the records once if you require reasonable
sequential access.

I remember reading once that an estimated 40% of all mainframe cycles in the
world were spent on sorting.  I am sure we would all be interested in
hearing about any special case you might have where your keysort is
completely free, but I doubt whether you will be able to do much better than
the "classical" sort/merge with replacement selection.  If you do end up
with a good method, you have a highly publishable (and patentable!) piece of
work.

-- 
A. Lester Buck		...!texbell!moray!siswat!buck