Reverbrain wiki

Site Tools


Replication and recovery in Elliptics


Elliptics since its first day uses replication to ensure data availability. One of the main design goals was to create a system, which is capable of dealing with the case when whole datacenter or geographical region is out of connectivity. Since those old days (first public elliptics version was released in early 2009) there are not so many players on the distributed storage market who is capable of dealing with this problem. Basically, Elliptics treats yearly Amazon’s whole-region-failure as a usual replica-is-not-available problem, and automatically switches to available copies in different regions. One of our clusters contains replicas in US (Nevada), Europe (Amsterdam) and Russia (several replicas in Moscow and region and Ryazan’s region).


To provide this level of availability we introduced notion of group. Basically, group is just a set of servers (this can be even single node), which are logically bound to each other by admin. Some of our current installations treat whole datacenter as a single group, so in example above we could have 5 groups: US, Europe and 3 in Russia.

But usually things are a bit more complex: there are different power lines, different connectivity possibilities and so on, so generally group (or replica) is a smaller set of machines in one datacenter. Nothing prevents to spread it over multiple datacenters of course.

Group node belongs to is written in config in group = section.

Client configuration

When client configures itself it says what groups he wants to work with. Elliptics client library stores this info to automatically switch between groups to find available replica. Due to eventual consistency, one of them may be unavailable or be not in sync, for this case there is timestamp for each record, which used by read-latest set of APIs – library reads timestamps from all available replicas, selects one with the latest update time and reads data from that group.


There are 3 types of recovery process in Elliptics: hash ring recovery aka merge, deep hash ring recovery aka deep_merge and replica recovery aka dc.

Hash ring recovery

Merge process is basically data moving within the same group from one node to another, which happens when one or several new nodes is added (or returned back). In this case route table changes and part of ID ranges covered by given node starts belonging to new node. System should move data from old nodes to new one, since new node starts serving requests immediately after its start not waiting for data to be copied to its storage. Its help message should explain the usage.

Replica recovery

Second recovery process is needed when one or more replicas lost its data due to disk damages or other problems. To date this process is rather time consuming (hundred of millions of records is checked and copied for roughly a week). We have to check every single key on every system to ensure that it is in sync. Moreover, when we check single server, its replicas do not considered checked (although they can be updated), so, to ensure that the whole cluster is in sync, one must check every node.

It is normal in distributed system when part of it is not available – after all we put several replicas exactly for that. So whole replica recovery may be started not too frequently. One has to determine frequency of data loss in single replica and run recovery according to that. In our systems we start it several times per month, sometimes even less frequent if we know, that things were OK.

Read repair

Eventually consistent systems are very scalable – its performance by orders of magnitude higher that that in synchronous systems, and its design allows true horizontal scalability without complexities and bugs.

But it comes for price – data recovery is postponed in time, and sometimes quite for a long period. All distributed systems provide some kind of data replication, so this is usually not a problem. But having complete set of fully recovered replicas is not only safe, it also provides better performance, since clients may read data in parallel from different replicas (in those systems like Elliptics which provide this functionality).

To close the gap between data recovery and current data requests, we implemented on-demand recovery in Elliptics. This is simply a data writeback into the storage, when we have read the object and detected that one or more replicas are missing.

We preserve timestamps of the records. This method does not imply that there is no need to perform regular data check (with data recovery if needed) anymore, instead it speeds up recovery for the most actively used objects.

Manual recovery

Starting with Elliptics 2.24 new recovery script was introduced called dnet_recovery. It provides fast and robust way of recovery for Elliptics networks. It's written in python and utilizes new metadata, iterators and new acync APIs provided by C++ binding.

It can work in both hash ring recovery and replica recovery modes. Usually recovery consists of number of parallel steps: determination of which key ranges needed to be recovered and from which nodes, running iterators on those nodes for particular ranges, computation of difference between local and remote iterator results and finally recovery of resulting difference. This process somehow varies between dc and merge modes but this still describes it enough for high-level overview.

dnet_recovery script

Elliptics now distributed with dnet_recovery script which provides very simple, informative and robust interface for recovery process.

Recovery options

Node that need to be recovered specified via -r option. It has to be in Elliptics format adress:port:family. On Linux IPv4 family specified with 2 and IPv6 with 10. This is mandatory option which should be always specified. Most likely dnet_recovery will be run via cron with -r $(hostname -f):1025:2

Groups that should be recovered specified comma separated with -g option. If no groups were specified then all groups are used.

Recovery speed is proportional to number of parallel workers (controlled via -n option) and batch size (controlled via -b option). Former should be set to number of disks in RAID for IO-bound workloads and to number of CPUs for cpu-bound workloads (of cause if system needs to be able to serve requests at the time of recovery this option should be bounded even more). Latter controls size of bulk-read/bulk-write batch. While bigger values speed up recovery they also consume more memory (memory consumption can be computed as avg_record_size * batch size * 2).

For speedup of recovery process one should sort data in eblobs by key via dnet_ioclient -r $(hostname -f):1025:2 -d start which should minimize disk seeks during iteration.

Also one can speed up recovery by specifying time range for recovery via -t option so that recovery will only take into an account keys that were modified since given timestamp. It can be given either as absolute epoch (e.g. 1368940603) value or as relative to now() in hours, days or weeks (e.g. 12h, 1d, or 4w). Latter is useful for cron jobs - for example one can put a cron job to recover only keys modified within last month running each week - this will have pretty decent speed but also most likely be equivalent to full recovery.

Only merge-wide option is -o which limits merge recovery only on node provided by -r/–remotes. It means that only the node specified by -r will be scanned for the keys which shouldn't be on it and then the keys will be moved to proper nodes within the group.

Admin options

By default script uses /var/tmp/dnet_recovery_%MODE% temporary directory to store iteration results from nodes (can be changed via -D option). Recovery log goes to tmp_dir/dnet_recovery.log (can be changed via -l option).

Script verbosity controlled via -d option if one is given then script switches to debug output dumping much more information on console. This option does not affect log file verbosity which is always set to DEBUG. Also one can set Elliptics library verbosity via -L option.

Also recovery stats are saved to tmp_dir/stat.txt. Stats are accessible via HTTP protocol if -m option was given.

There is also -N option which turns on so-called dry-run mode that do not actually modify any data. This can be used by admins to test recovery procedure.

For use in cronjobs -S/–safe option should be used in merge mode to skip deletion of recovered keys.

By default only one instance of any recovery type can be run. For the whole duration of recovery file specified via -k option is locked.


Recovery stats has several section each describes some stage of recovering except main secion. All counters in stats has 3 values:

  • some_counter_success - count of succeeded operations or number of keys which was successfully read or written
  • some_counter_failed - count of failed operations or number of keys which hasn't been read or written
  • some_counter_total = counter_success + counter_failed

Main section has title “monitor” and contains information about all recovery:

  • main_started - when it was started
  • main_finished - when it was finished
  • iterations - numer of iterate operation
  • only dc recovery main stats
    • main_started…merge_and_split - time is spent for iterating, sorting and finding diffs
    • merged_diffs - total count of keys after merge_and_split operation. for _total - number of keys from all remote node which will be recovered
    • merged_diffs_node - number of keys which will be recovered from this node.

Iterate section has title “iterate_local” - for local node and “iterate_ip” - for remote ip node. This section contains information about iteration of node and sorting iterator result:

  • iterations - number of iterators which was completed. iterations_success should be = 1 if node iteration has been succeeded and iterations_failed = 1 when node iteration failed
  • iterated_keys - number of keys which was iterated from node.
  • sort - number of sort operations which was completed. If sort_success = 1 then sort has been succeeded and sort_failed - sort has been failed.
  • process_started - when node iteration was started
  • process_started…iterate - initialization time before iterating
  • process_iterate…sort - time is spent for node iteration
  • process_sort…finished - time is spent for sorting node iteration results
  • process_finished - when iteration and sort was finished

Diff section has title “diff_remote_ip”. This section contains information about finding differences between local keys set and remote keys set:

  • diff - number of keys which are absent on local node and are presented on remote node or which timestamp from remote node is newer than on local node
  • process_start - when diff was started
  • process_start…diff - time is spent for initialization
  • process_diff…finished* - time is spent for diff * process_finished* - when diff was finished

Recovering section has title “recover_ip”. This section contains information about recovering keys by reading keys from ip node and writing read data to local node:

  • read_keys - number of keys which was read from remote node
  • recovered_bytes - number of bytes of data which were read from remote node and then written to local node
  • recovered_keys - number of keys which were read from remote node and then written to local node
  • recover_started - when recovering keys from remote node was started
  • recover_started…finished - time is spent for recovering keys from remote node
  • recover_finished - when recovering keys from remote node was finished

All sections filled as recovery progresses through stages. First it will be iteration and sort then diff and then recovery itself. If some stage fails for some node this node wouldn't participate in rest stages. For example, if all iterations is failed then it is nothing to sort and nothing to diff and recovery, so recovery will exit with code 1.

In-depth merge recovery description

Merge is recovery mode that tries to maintain sanity in keys placement within one hash ring (usually one failure zone e.g. datacenter). This is done by downloading routing table and computing which node is responsible for which range. Then on all nodes from specified groups we need to determine ranges which are not belong to the node. On each node it starts to iterate keys and collects keys which shouldn't be on the node. For each found key it moves the key from the node to proper node only if proper node hasn't had already newer key data. If proper node already has newer key data then the key is just removed from the iterated node.

Removal of keys from source node can be skipped if option -S is specified.

In-depth dc recovery description

DC recovery mode is intended for recovering outdated or missing key on elliptics node by using data on elliptics node from other groups. This recovery mode is useful for case when whole group was offline or for synchronization data in different groups. The script downloads routes table and computes recovering node ranges, determines nodes from other groups which ranges are intersect with recovering node ranges. Then runs iterators on recovering node and on all found nodes. On recovering node iterator goes on all node ranges. On nodes from other groups iterator goes on intersects the node ranges and recovering node ranges. All results are sorted and computed differents between them and recovering node iterator result. Then removes duplicated keys from results and keeps key with latest timestamp. At the end all picked keys are read from their nodes and are written to recovering node. All operation with nodes (iteration and reading/writing) are executed in process pool size of which is specified via -n option.

Computes stage execution time

  • Iteration/sort/diff stage:
    • there is no way to determine time to end of this stage
  • Merging diffs:
    • estimated time to the end of stage: (size of diff/iterator files / size of merge files - 1) * time after the start of the stage. All file locates in temporary directory which is specified via -D option.
  • Copying/recovering stage:
    • (merged_diffs / recovered_keys_success - 1) * time after the start of the stage. merged_diffs and recovered_keys_success can be found in monitor data.

Known Issues

  • Buffer bloat between C++ and Python. Iterator results are buffered on client side of elliptics library already read from network but still waiting to be read by Python. We will fix that one by introduction iterator flow-control on client side.
  • Recovery process can stall if dnet_ioserv is killed by OOM in the middle of recovery. This will be also fixed soon by waking up condition variable waiters on socket error.
elliptics/replication.txt · Last modified: 2015/05/21 04:07 by zbr