blueprints:eblob:data-sort

**Assignance**: rbtz@

**Status**: TESTING

**DependsOn**: binlog

**ETA**: 4 weeks

**WIP repository**: https://github.com/SaveTheRbtz/eblob/tree/data-sort

Sort eblob data along with index itself.

Every time after we've finished writing eblob when hitting data size or number of entries cap we are sorting its' index, removing it from memory and on next access we can preform binary search on it.

Aim of this project is to sort eblob's data too, so we can then easily fetch ranges of keys.

We already have sorted index of all entries in eblob stored in. So we could construct sorted data file based only on this index.

This is a bad idea, since iterating over data using sorted index leads to enormous amount of disk seeks, which slows down whole system to the level, where it is merely unresponsive. Better check it on bit datafiles like 20-100-200 Gb - such sorting will never end, while merge sort works with half of the disk speed.

Let's implement merge sort then:

There are many implementations of mergesort including “Natural” and “Polyphase” variants. Most common are:

- D. Knuth, Algorithm 5.2.4L, TAOCP Volume 3, 2nd Edition. (holy mess of goto's by the way)
- “Textbook” recursive.

Both of them suffer from very slow initial phase which sorts data in power of two chunks starting from 1. This is not suited for us, so lets go hybrid: in-memory quicksort for small amounts of data and merge sort of resulting files.

- Start binlog
- Split into chunks using iterator
- Sort chunks
- Merge sorted chunks
- Lock base
- Apply binlog
- Swap original base with sorted one
- Flush caches
- Unlock base

- Get familiar with eblob code base (1-2 days)
- Implement binary log for eblob (1 week)
- Implement data-sort (3 weeks)
- Write tests for new code (2 days)
- Fix bugs found by tests (5 days)
- Deploy to testing (???)
- Fix bugs found in testing, update tests and re-deploy (???)

blueprints/eblob/data-sort.txt · Last modified: 2013/02/04 00:38 by savetherbtz