Reverbrain wiki

Site Tools


blueprints:eblob:data-sort

Info

Assignance: rbtz@

Status: TESTING

DependsOn: binlog

ETA: 4 weeks

WIP repository: https://github.com/SaveTheRbtz/eblob/tree/data-sort

Summary

Sort eblob data along with index itself.

Whiteboard

Every time after we've finished writing eblob when hitting data size or number of entries cap we are sorting its' index, removing it from memory and on next access we can preform binary search on it.

Aim of this project is to sort eblob's data too, so we can then easily fetch ranges of keys.

Technical information

We already have sorted index of all entries in eblob stored in. So we could construct sorted data file based only on this index.

This is a bad idea, since iterating over data using sorted index leads to enormous amount of disk seeks, which slows down whole system to the level, where it is merely unresponsive. Better check it on bit datafiles like 20-100-200 Gb - such sorting will never end, while merge sort works with half of the disk speed.

Let's implement merge sort then:

There are many implementations of mergesort including “Natural” and “Polyphase” variants. Most common are:

  1. D. Knuth, Algorithm 5.2.4L, TAOCP Volume 3, 2nd Edition. (holy mess of goto's by the way)
  2. “Textbook” recursive.

Both of them suffer from very slow initial phase which sorts data in power of two chunks starting from 1. This is not suited for us, so lets go hybrid: in-memory quicksort for small amounts of data and merge sort of resulting files.

  1. Start binlog
  2. Split into chunks using iterator
  3. Sort chunks
  4. Merge sorted chunks
  5. Lock base
  6. Apply binlog
  7. Swap original base with sorted one
  8. Flush caches
  9. Unlock base

Work Items

  • Get familiar with eblob code base (1-2 days)
  • Implement binary log for eblob (1 week)
  • Implement data-sort (3 weeks)
  • Write tests for new code (2 days)
  • Fix bugs found by tests (5 days)
  • Deploy to testing (???)
  • Fix bugs found in testing, update tests and re-deploy (???)
blueprints/eblob/data-sort.txt · Last modified: 2013/02/04 00:38 by savetherbtz