Eblob database consists of many so-called bases. Each base consists of number of files, generally those are: data-file, index file, sorted index file and sorted index marker:
Here is an example of my tiny eblob database:
-rw-r--r-- 1 rbtz staff 358 May 31 16:34 test-blob-0.0 -rw-r--r-- 1 rbtz staff 0 May 31 16:34 test-blob-0.0.data_is_sorted -rw-r--r-- 2 rbtz staff 192 May 31 16:34 test-blob-0.0.index -rw-r--r-- 2 rbtz staff 192 May 31 16:34 test-blob-0.0.index.sorted -rw-r--r-- 1 rbtz staff 529 May 31 16:34 test-blob-0.1 -rw-r--r-- 1 rbtz staff 288 May 31 16:34 test-blob-0.1.index -rw-r--r-- 1 rbtz staff 288 May 31 16:34 test-blob-0.1.index.sorted -rw-r--r-- 1 rbtz staff 1571 May 31 16:34 test-blob-0.2 -rw-r--r-- 1 rbtz staff 768 May 31 16:34 test-blob-0.2.index -rw-r--r-- 1 rbtz staff 532 May 31 16:34 test-blob-0.3 -rw-r--r-- 1 rbtz staff 192 May 31 16:34 test-blob-0.3.index -rw-r--r-- 1 rbtz staff 100 May 31 16:34 test-blob.stat -rw-r--r-- 1 rbtz staff 0 May 31 16:34 test-blob.lock
In this example we see set of file types for each base:
test-blob-0.0is “data file” - all data goes here (prepended by small header which is equal to index file record).
data-sortmarker which states that records in “data-file” are sorted by key.
test-blob-0.0.index- is unsorted index that consists of “index records”. Used for fast lookups.
test-blob-0.0.index.sorted- is index sorted by key for even faster lookups.
Here we also see 4 bases:
0,1,2,3. Three of them (0,1,2) are so-called “closed” – no new writes go there (with overwrite as a little exceptions that I will mention later). Base
0 is sorted – this is indicated by
*.data_is_sorted marker. Base
1 has only sorted index, but data is in write-order – this is because
data-sort process did not run since this base was closed. Base
2 don't even have sorted index - it's index is still in memory in red-black tree and it will be sorted either after
data-sort or on next eblob init. Base
3 is “opened” – all new writes are appended to it.
.index.sorted files we can save memory via purging all entries from red-black tree and using sorted index preform binary search (it's more sophisticated than plain
bsearch(3), it'll be explained later) on part of index that contains that key.
Also, you can read more about data-sort process and it's pros and cons here.
There are also two supplementary files:
.lock. Former is used as text-based statistics representation for opened database. Latter one is used for
lockf(2)-based database locking to prevent corruption on simultaneous opens.
-0. in that listing is type number - it's column in eblob terms, but don't mind it - it'll was removed somewhere around
0.20.0 because it adds unnecessary complexity to eblob code while don't add any value.
Now that we have some knowledge of eblob internals I can explain how write path is structured.
Generally when we pass data to
eblob_write() call it goes through following steps:
blob_size) and records number(
records_in_blob). Those limits can be adjusted in eblob_config.
keyin in-memory index and all on-disk indexes;
hashin eblob terms) and in unsorted index (
.index) to disk.
By default eblob tries to overwrite data in-place, so it's not just not blindly removed but checked if it can be overwritten “in-place” without corrupting following record. So eblob is append-only storage which can overwrite old data in-place for performance reasons if requested. Here I should mention that all records in base have two associated sizes in their header (“index record”):
disk_size. Disk size is basically how much space there is between this record and next one, Data size is how much of read data is returned to user and so it's size that is directly visible to user. Disk size is always greater or equal than
data_size. If there is not enough
disk_size to write new record it's just appended to “opened” blob like always instead of overwriting in place.
In reality write path is more complicated than that because in presence of data-sort for blobs that are in process of sorting right now we always preform read-copy-update cycle because we can't overwrite entries in-place. Also all remove operations are copied to in-memory log. For it's in-depth description see data-sort.
BLOB_DISK_CTL_APPEND tries to append data to the end of an existing record. But you can recall that in data-file on disk right after appended record there is another one, so eblob can't blindly append data to it. Instead it checks that there is enough space in difference between
data_size to accommodate size of appended data and if there is not then it copies data from it's original position to the end of “opened” base and multiply
disk_size by two, so it will be more possible for following appends to succeed in overwriting data in-place.
Append writes are behaving like atomic offset writes with
There are much more nuances about writes in blob esp. when using multi-stage
prepare/plain_write/commit protocol but for basic understanding this is pretty much all.
Read path is not so complicated as write path, possibly because of absence of multi-stage protocol and different read flags.
When read request for some
key arrives to libeblob via
eblob_read() it goes through following steps:
keyin our in-memory cache (red-black tree known as
hashinside eblob). This tree stores
offsets for index and data, so we can return data right away.
keyis not found in in-memory cache we should examine EACH bases' indexes. This is rather heavy-weight process so there are lots of optimizations involved:
bsearchis executed only for small set of index entries (by default 50, can be tuned via config
As you can see read path of eblob is heavily tuned esp. for workloads where indexes do not fit into RAM.