ETA: 4 weeks
WIP repository: https://github.com/SaveTheRbtz/eblob/tree/data-sort
Sort eblob data along with index itself.
Every time after we've finished writing eblob when hitting data size or number of entries cap we are sorting its' index, removing it from memory and on next access we can preform binary search on it.
Aim of this project is to sort eblob's data too, so we can then easily fetch ranges of keys.
We already have sorted index of all entries in eblob stored in. So we could construct sorted data file based only on this index.
This is a bad idea, since iterating over data using sorted index leads to enormous amount of disk seeks, which slows down whole system to the level, where it is merely unresponsive. Better check it on bit datafiles like 20-100-200 Gb - such sorting will never end, while merge sort works with half of the disk speed.
Let's implement merge sort then:
There are many implementations of mergesort including “Natural” and “Polyphase” variants. Most common are:
Both of them suffer from very slow initial phase which sorts data in power of two chunks starting from 1. This is not suited for us, so lets go hybrid: in-memory quicksort for small amounts of data and merge sort of resulting files.