Reverbrain wiki

Site Tools


elliptics:cache

Cache

Cache workflow was gradually changed in elliptics 2.24.13.12. We mean it.

Introduction

Cache is a layer between backend and user requests which main goal is reducing number of disks' IO operations. All read/write/remove operations go through the cache even if NOCACHE flag is set. NOCACHE means that cache will not be populated from the disk if there is no data in cache. If you have something in cache, it will be updated anyway to maintain cache coherency. Usually this is a bad design to mix cache and no-cache operations for the same keys in client code.

If data is being written to cache with DNET_IO_FLAGS_CACHE it will be synced to disk after cache_sync_timeout specified in Elliptics configuration. It is 30 seconds by default. Make attention that if remove with CACHE_ONLY flag will be received by server - all not-synced-yet data will be erased, so make sure not to mix CACHE-only and CACHE+CACHE_ONLY flags for the same keys as it may lead to unwritten data to disk.

Design

Cache layer consists of caches_number independent caches. Requests are equiprobably distributed between caches to decrease latency caused by locks.

Each cache is implemented as multi-page LRU, which is simple yet powerful and flexible generalization of Segmented LRU. Cache consists of cache pages, that are lightweight LRU lists. Each cache page has constant 'temperature' that characterizes popularity of records in this page. Cache pages are ordered by their 'temperature'. When data is first time added to the cache it gets into the top of last (“coldest”) page. Each time, when it's referenced again, it moves to the top of next (“hotter”) page. If it is already in the hottest page, it stays there. Similar eviction scheme is used: when cache page overflows, it moves record from it's LRU list bottom to previous (“colder”) page. When record is evicted from the last page, it is removed from cache.

Benefit of this strategy is explained by this logic: data that is accessed more frequent will be in the hottest cache page, therefore it is less likely to be removed. Profit directly relies on the pattern of user-data interaction that can be very specific for different applications.

Cache behavior

Elliptics works with cache in case if flags DNET_IO_FLAGS_CACHE or DNET_IO_FLAGS_CACHE_ONLY are set.

If neither of this flags are set yet READ, WRITE, REMOVE and LOOKUP operations for given ID still go through cache if it is present in cache, otherwise commands go right to the disk backend.

Here are examples of various operations with DNET_IO_FLAGS_CACHE bit set.

Read operation:

  1. Entry in cache is a result of previous appends, then appends are applied to disk data and whole file is copied to the cache and then returned.
  2. Entry is already in cache, then it's just returned.
  3. Entry is not present in cache and CACHE_ONLY flag is not set (but CACHE flag is present), then cache is populated by entry's id from disk and value from cache is returned.
  4. Otherwise not-found error is returned.

Write operation:

  1. Entry is already in cache, then it's just replaced. If CACHE_ONLY flags is not set entry is added to sync queue and will be flushed to disk after cache_timeout (30 seconds by default).
  2. Entry is not present in cache or it is result of previous appends, CACHE_ONLY is not set, APPEND flag is set, then only data to be appended is stored in the cache. Entry is added to sync queue. There are special internal flags to force flushing process to append given data and not overwrite existing objects.
  3. Entry is not present in cache, then it's added to cache. If CACHE_ONLY flags is not set entry is added to sync queue.

Remove operation:

  1. CACHE_ONLY flag is set, then it's removed from cache.
  2. CACHE_ONLY flag not set, then it's removed from cache and disk.

Lookup operation:

  1. Entry is in cache then take “timestamp” from it.
  2. Otherwise take “timestamp” from disk.
  3. Always take size, offset, blob path from disk.
  4. If key is present in cache, but there is no data on disk, size is being set to 0 (i.e. you can not read data from disk by this key), but timestamp is being set from the cache.

Make attention that if entry is not synced to disk yet lookup will give “old” data position, i.e. lookup returns information about current disk location of the given object.

Lifecheck operation:

Each 1 second cache runs lifecheck thread that syncs data with disk and erases old data. It repeats the following sequence of actions:

  1. Get global cache lock
  2. Create list of elements that need to be synced or erased
  3. Release global cache lock
  4. Iterate over elements for sync
    1. Get dnet_oplock on element for syncing
    2. Sync element
    3. Release dnet_oplock on element for syncing
  5. Get global cache lock
  6. Iterate over elements for erase
    1. Erase element
  7. Release global cache lock
  8. Sleep for 1 second

This scheme is designed that way to avoid races over global cache lock.

Configuration

There are several parameters for tweaking cache layer:

  1. cache_size - total size of cache layer in bytes
  2. caches_number - number of caches (default 16)
  3. cache_pages_proportions - this parameters allows you to manipulate relative sizes of cache pages, for example if you specify cache_pages_proportions = 1:2:3 your pages will have sizes n, 2n, 3n, where n = cache_size / 6

Implementation details

During work with cache we need to handle efficiently such queries:

  1. find - check if cache contains some key
  2. sync/timeout - synchronize or remove old data records
  3. erase - remove key from cache

This problem can easily be solved using search tree data structure for find/erase requests and priority queue(heap) to manage latest sync and timeout events. But this approach involves a lot of memory overhead, that's why we invented a way to use Cartesian tree data structure for this purpose. Each node of the tree is described with it's key and priority, and main tree property is that it is binary search tree considering keys and heap considering priorities. In our case key is data record's id and priority is minimal event time related to this record. This way, root of the tree contains first event to be processed, and we still can quickly find or erase key from the tree. Moreover, this approach requires only 2 pointers instead of 6. Comparsion of running times showed that cartesian tree is better than std::set on inserts and erases and slightly the same on finds. This graph shows three implementations, std::set, cartesian tree on pointers and cartesian tree on memory allocated in vector.

elliptics/cache.txt · Last modified: 2014/02/25 16:56 by zbr