Reverbrain wiki

Site Tools


eblob:internals:reads_and_writes

Table of Contents

Background

Eblob database consists of many so-called bases. Each base consists of number of files, generally those are: data-file, index file, sorted index file and sorted index marker:

Here is an example of my tiny eblob database:

-rw-r--r--   1 rbtz  staff      358 May 31 16:34 test-blob-0.0
-rw-r--r--   1 rbtz  staff        0 May 31 16:34 test-blob-0.0.data_is_sorted
-rw-r--r--   2 rbtz  staff      192 May 31 16:34 test-blob-0.0.index
-rw-r--r--   2 rbtz  staff      192 May 31 16:34 test-blob-0.0.index.sorted
-rw-r--r--   1 rbtz  staff      529 May 31 16:34 test-blob-0.1
-rw-r--r--   1 rbtz  staff      288 May 31 16:34 test-blob-0.1.index
-rw-r--r--   1 rbtz  staff      288 May 31 16:34 test-blob-0.1.index.sorted
-rw-r--r--   1 rbtz  staff     1571 May 31 16:34 test-blob-0.2
-rw-r--r--   1 rbtz  staff      768 May 31 16:34 test-blob-0.2.index
-rw-r--r--   1 rbtz  staff      532 May 31 16:34 test-blob-0.3
-rw-r--r--   1 rbtz  staff      192 May 31 16:34 test-blob-0.3.index
-rw-r--r--   1 rbtz  staff      100 May 31 16:34 test-blob.stat
-rw-r--r--   1 rbtz  staff        0 May 31 16:34 test-blob.lock

In this example we see set of file types for each base:

  • test-blob-0.0 is “data file” - all data goes here (prepended by small header which is equal to index file record).
  • test-blob-0.0.data_is_sorted - is data-sort marker which states that records in “data-file” are sorted by key.
  • test-blob-0.0.index - is unsorted index that consists of “index records”. Used for fast lookups.
  • test-blob-0.0.index.sorted - is index sorted by key for even faster lookups.

Here we also see 4 bases: 0,1,2,3. Three of them (0,1,2) are so-called “closed” – no new writes go there (with overwrite as a little exceptions that I will mention later). Base 0 is sorted – this is indicated by *.data_is_sorted marker. Base 1 has only sorted index, but data is in write-order – this is because data-sort process did not run since this base was closed. Base 2 don't even have sorted index - it's index is still in memory in red-black tree and it will be sorted either after data-sort or on next eblob init. Base 3 is “opened” – all new writes are appended to it.

By having .index.sorted files we can save memory via purging all entries from red-black tree and using sorted index preform binary search (it's more sophisticated than plain bsearch(3), it'll be explained later) on part of index that contains that key. Also, you can read more about data-sort process and it's pros and cons here.

There are also two supplementary files: .stat and .lock. Former is used as text-based statistics representation for opened database. Latter one is used for lockf(2)-based database locking to prevent corruption on simultaneous opens.

BTW, -0. in that listing is type number - it's column in eblob terms, but don't mind it - it'll was removed somewhere around 0.20.0 because it adds unnecessary complexity to eblob code while don't add any value.

Write path

Now that we have some knowledge of eblob internals I can explain how write path is structured.

Generally when we pass data to libeblob via eblob_write() call it goes through following steps:

  • Get “opened” base.
  • Check that it not too big in terms of size(blob_size) and records number(records_in_blob). Those limits can be adjusted in eblob_config.
  • If base is over it's limits this base is “closed”, new base is created and marked as “opened”.
  • Either way record is search by key in in-memory index and all on-disk indexes;
    • If it exists - it's marked as removed in “index file” of it's base.
  • Then new data is appended to “data file” of “opened” base.
  • New index entry is constructed and placed in both in-memory cache (red-black tree called hash in eblob terms) and in unsorted index (.index) to disk.

By default eblob tries to overwrite data in-place, so it's not just not blindly removed but checked if it can be overwritten “in-place” without corrupting following record. So eblob is append-only storage which can overwrite old data in-place for performance reasons if requested. Here I should mention that all records in base have two associated sizes in their header (“index record”): data_size and disk_size. Disk size is basically how much space there is between this record and next one, Data size is how much of read data is returned to user and so it's size that is directly visible to user. Disk size is always greater or equal than data_size. If there is not enough disk_size to write new record it's just appended to “opened” blob like always instead of overwriting in place.

In reality write path is more complicated than that because in presence of data-sort for blobs that are in process of sorting right now we always preform read-copy-update cycle because we can't overwrite entries in-place. Also all remove operations are copied to in-memory log. For it's in-depth description see data-sort.

Flag BLOB_DISK_CTL_APPEND tries to append data to the end of an existing record. But you can recall that in data-file on disk right after appended record there is another one, so eblob can't blindly append data to it. Instead it checks that there is enough space in difference between disk_size and data_size to accommodate size of appended data and if there is not then it copies data from it's original position to the end of “opened” base and multiply disk_size by two, so it will be more possible for following appends to succeed in overwriting data in-place. Append writes are behaving like atomic offset writes with offset==data_size.

There are much more nuances about writes in blob esp. when using multi-stage prepare/plain_write/commit protocol but for basic understanding this is pretty much all.

Read path

Read path is not so complicated as write path, possibly because of absence of multi-stage protocol and different read flags.

When read request for some key arrives to libeblob via eblob_read() it goes through following steps:

  • Firstly we search for key in our in-memory cache (red-black tree known as hash inside eblob). This tree stores fds and offsets for index and data, so we can return data right away.
  • If key is not found in in-memory cache we should examine EACH bases' indexes. This is rather heavy-weight process so there are lots of optimizations involved:
    • We proceed bases in reverse order, because data have higher possibility of being in recent bases. This is true judging by both: user behavior and eblob overwrite behavior.
    • Each base have bloom filter attached to it, so we filter a lot (around 99.9%) of bases by it.
    • Each base have index block tree. Because index blocks are constructed only for sorted indexes this tree is consists of start-stop ranges of keys. If key does not belong to any start-stop range base is skipped.
    • If key belongs to range then bsearch is executed only for small set of index entries (by default 50, can be tuned via config index_block_size).

As you can see read path of eblob is heavily tuned esp. for workloads where indexes do not fit into RAM.

eblob/internals/reads_and_writes.txt · Last modified: 2013/09/04 21:02 by savetherbtz