A utility for sorting really big files.

Comments are moderated. It may take a few minutes before your comment appears.
Markdown is supported in your comments.

Obviously if the ordering of a flat file does not matter then you should sort it. But oddly the places that could most likely benefit from sorted data don't seem to bother. I ran into one of these when I started playing with the (now defunct) Freebase RDF Triple dump. This is a pretty decently sized pile of data. 3 billion facts (each on a line), 30GB compressed, 425GB uncompressed. Even just casually browsing though the data I saw a fair number of duplicate entries. I wanted to sort it before really digging in.

Mail: (not shown)

Please type this: