Saturday, February 28, 2015

External sort


If you have a 2 GB file with one string per line, which sorting algorithm would you use to sort the file and why? 

For large files that cannot be fit into the memory, we need to do external sort. Assume  the memory is 1 GB. 

  1. Divide the file into K chunks, where X * K = 2GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm, e.g., quick sort. 
  2. Write the sorted data to disk. 
  3. Repeat the above steps until all K chunks are sorted. 
  4. Read the first N portion of each chunk (so that X * N < 1GB) into input buffers in main memory and allocate the remaining memory for an output buffer. 
  5. Perform a k-way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the X input buffers empties, fill it with the next N portion of the chunk until no more data from the chunk is available. 
Additional passes, e.g., if we have 500 chunks, after sorting each chunk, we can run a first merge pass combining 25 chunks at a time, resulting in 20 large sorted chunks. Run a second merge pass to merge the 20 larger sorted chunks. 

Efficient external sorts require O(n log n) where n is the total elements to be sorted. Using parallelism may improve performance. 

No comments:

Post a Comment