Update this project:
tl;dr * [sources are at github](https://github.com/keenerd/gz-sort/) * `git clone https://github.com/keenerd/gz-sort; cd gz-sort; make; ./gz-sort -h` ## What does it do? Gz-sort sorts gzipped data files. Really really big gzipped data files that I couldn't figure out how to wrangle with [gnu-sort](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html). After heavy tweaking gnu-sort can do some very large files indeed, but with poor big-O disk patterns. It was bad enough that I physically could not sort a data file without buying a new hard drive. So instead I wrote a sort that runs in O(1) memory, O(n) disk and O(n log n) time. ## Minimum requirements to sort a terabyte: * 4MB ram (yes, megabyte) * free disk space equal to the twice the compressed source.gz ## Why it exists Sorted text files have a whole lot of utility: * remove (or count) duplicate lines * easy binary searches * compresses to half the size of shuffled data Obviously if the ordering of a flat file does not matter then you should sort it. But oddly the places that could most likely benefit from sorted data don't seem to bother. I ran into one of these when I started playing with the (now defunct) Freebase RDF Triple dump. This is a pretty decently sized pile of data. 3 billion facts (each on a line), 30GB compressed, 425GB uncompressed. Even just casually browsing though the data I saw a fair number of duplicate entries. I wanted to sort it before really digging in. I also had a problem: insufficient disk space to store even a single uncompressed copy of the data. This makes sorting difficult. Gnu-sort really likes uncompressed data. It wants to read uncompressed input, write uncompressed output and store uncompressed scratch data. Input and output are simple to handle; use pipes and gzip. The scratch space will first use all available ram and spill over into tmpfs if memory use gets too high. This is completely unworkable when your tmpfs is in ram and the uncompressed data is 25 times larger than your ram. There is an option to use disk instead of tmpfs, but that doesn't work either because it will put an entire uncompressed copy onto disk in the process. So it needs to be combined with an obscure option that tells gnu-sort to compress all the scratch data. Overall my best attempt ended up looking like this: gzip -dc source.gz | LANG=C sort -u -S 15G --parallel=4 -T ~/ --compress-program=gzip | gzip > dest.gz The `LANG=C` bit bypasses all the localization and slashes the run time. I also tell it to use almost all my RAM and all my cores. With 90GB free at the time this should have worked. In practice it crashed after 15 minutes, with an out-of-disk error. > ###"Old age and treachery always overcomes youth and skill." (Waylon Jennings) The gnu-coreutils are known for being [fast and sneaky](http://ridiculousfish.com/blog/posts/old-age-and-treachery.html). Gnu-sort is no exception. I've tried to write "a better sort" several times in the past, and it typically goes well until I remember about `LANG=C` at which point gnu-sort blows away whatever clever data structure I was working on. But occasionally youth and meticulous attention-to-detail pay off. In a rather wild four-day weekend, gz-sort was born. It took two days to write an efficient core and another two days to add/debug multithreading. Right now it works well enough for my purposes, but keep in mind how rough the code base is. How does it fare on the Freebase dump? * The input: 3 billion lines, 30GB compressed, 425GB uncompressed. * The rig: quad-core A8-7600, 16GB ram, 256GB 850 Pro SSD with 90GB available. * The algorithm: a simple merge sort, predicted to finish in 10.2 hours. (Actual time, 9.5 hours.) * The output: 25.2GB compressed (16% smaller) with 5 million duplicate lines removed. All that from a 14KB binary (650 lines of C) which had a median memory use of 10MB, running on a fairly average desktop. ## Doesn't this already exist? As far as I could tell, not really. Let's look at a Wikipedia page and two Stack Overflow posts to get an idea of "state of the art" among the average programmer. [External sorting](https://en.wikipedia.org/wiki/External_sorting) is exactly what we want! It's even got some external links, which are also exactly what we want. They are, in summary: * "STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks." Close, but my data doesn't even fit on disk. * mergesort example - "It is assumed that each line of the file contains a word and is no more than 31 characters in length" * K-Way Merge - Dead google code repo. * externalsortinginjava - Ehhh, java. * pennysort/judyarrays - Another dead google code repo. * sortbenchmark.org - Fascinating but they all seem to be for weird synthetic data sets. [Post one](http://stackoverflow.com/questions/2087469/sort-a-file-with-huge-volume-of-data-given-memory-constraint) is from 2010. * Use an external sort. Recommended five times. No suggestions to anything specific. * Use a database. Recommended four times. "10-15 millions of flat file lines" isn't very large. * Use cloud and big data tools. No thanks, I like doing things locally. [Post two](http://unix.stackexchange.com/questions/120096/how-to-sort-big-files) is from 2014 and deals with a 200GB input file. * Use gnu-sort. Two recommendations. "GNU sort is designed to cope well with files that are much larger than RAM." But it can't really do files larger than disk. So there doesn't seem to be anything practical out there. If I am wrong, please send me a link. I'd love to have something to benchmark against. ## Vs Gnu-Sort Since I physically can't compare the two utilities with the Freebase data, I extracted some sample values for a toy demo instead. The sample was 125 million lines long, 2GB when uncompressed and 600MB compressed. During tests memory will be capped at 200MB, to simulate working with a file 10x larger than ram. Four cores and as much disk as they want may be used. > time gzip -dc source.gz | LANG=C sort -S 200M --parallel=4 -T ~/ --compress-program=gzip | gzip > dest.gz 4.8 minutes > time gz-sort -S 200M -P 4 source.gz dest.gz 5.4 minutes Even after severely hamstringing gnu-sort, it still manages to be 11% faster than gz-sort. (Normally gnu-sort would crush this task in 30 seconds.) After sorting the file was 300MB. ## Wait that doesn't seem impressive! You mean to say that after all that, gnu-sort is still faster? Sadly yes. Gz-sort did use slightly less scratch space (410MB vs 500MB). But I feel that I am missing something. At that level of disk usage, gnu-sort should have been able to process the Freebase data. Gnu-sort appears to have yet another level of "gotcha" built in, beyond the default tmpfs and uncompressed scratch gotchas, that crops up at these scales. Gz-sort wins in the "does not do anything surprising" category, scales perfectly over truly monstrous files, and actually manages to finish the big job. It should also be noted that the memory use patterns are dramatically different. Gnu-sort hogs as much memory as it can take for the entire run. Gz-sort is similar but releases all that memory back to the OS once the first pass is completed. So while sorting the 425GB dataset, it used all the ram for the first hour and then dropped back to 10MB for the other eight. Additionally there is zero IPC between the threads, further saving resources. The disk access patterns are all linear reads and writes, with an average IO of 25MB/sec. (The SSD had a random-write performance of 300MB/sec.) After that initial ram-heavy presort, if gz-sort was nice-ed you wouldn't even notice it processing hundreds of gigabytes of data. Gz-sort scales smoothly and predictably and it is very good at not hogging the computer. It also provides a quick answer to the interviewer who asks "How do you sort 200GB of data using only an RPi with a 64GB SD card?" ## How to use it First, [get the sources](https://github.com/keenerd/gz-sort/) and build it with make. It requires the zlib headers and a system with various gnu extensions. gz-sort [-u] [-S n] [-P n] source.gz destination.gz * `-h` produces the standard help text. * `-u` uniques the output, same as gnu-sort. * `-S n` sets the buffer size. It allows k/M/G suffixes, same as gnu-sort. By default it is set to an extremely conservative 1MB. Crank this way up! * `-P n` is analogous to `--parallel` in gnu-sort and sets the thread count. ## Estimating run time I keep saying that gz-sort is predictable, so let's make some predictions. About how long would `gz-sort -S 12GB -P 4 source.gz` take? First, get a baseline of what you are working with. Approximately half of gz-sort's cpu cycles are spent on zlib, so it makes a good benchmark: time gzip -dc source.gz | gzip > /dev/null In the Freebase example that took 114 minutes on my desktop. Also measure the uncompressed size (425GB in this case), the number of threads (4) and the maximum ram use (12GB here). I would have used the full 16GB, but I wanted a bit of margin in case malloc did something silly or I had a small leak. (It didn't, I could have used 15GB safely.) Next, find the unthreaded run time total_time = zlib_time * entropy * (log2(uncompressed_size/buffer_size) + 2) Where entropy is a fudge factor between 1.5 and 3 for how unsorted the source data is. The Freebase dataset appeared to be well-grouped locally, so 1.5 seems appropriate. Plugging everything in gives 20.4 hours unthreaded. Threading appears to scale with the square root of the number of threads and using 4 threads cuts the run time in half, to 10.2 hours. The measured run time was 9.5 hours, 6.8% off of the estimate. I could very well be wrong about the `sqrt(threads)` thing. Someone with a 64 core beast will have to weigh in. I suspect that the final n-way merge will begin to dominate and drag the time up. ## Known bugs to fix Email me if you are using gz-sort and any of these omissions are causing you trouble. For that matter, email me if you find something not on this list too. * Does not build on non-gnu systems. * Sqrt(threads) is a terrible ratio. * No support for uncompressed stdin streams. * Breaks if a line is longer than the buffer size. * Lacks all error handling. * Ugly code with lots of ways to refactor. * Output could use predictable flushes. ## Performance tweaks to try * Profile! * Parallelize the final n-way merge. This will require adding IPC. * Filter unique lines during the earlier passes. * Try out zlib-ng, about half of cpu time is spent on (un)gzipping. * Improve memory estimation, it lowballs and that hurts the presort. * Byte-based seeking instead of line-counting.