Elliptics since its first day uses replication to ensure data availability. One of the main design goals was to create a system, which is capable of dealing with the case when whole datacenter or geographical region is out of connectivity. Since those old days (first public elliptics version was released in early 2009) there are not so many players on the distributed storage market who is capable of dealing with this problem. Basically, Elliptics treats yearly Amazon’s whole-region-failure as a usual replica-is-not-available problem, and automatically switches to available copies in different regions. One of our clusters contains replicas in US (Nevada), Europe (Amsterdam) and Russia (several replicas in Moscow and region and Ryazan’s region).
To provide this level of availability we introduced notion of group. Basically, group is just a set of servers (this can be even single node), which are logically bound to each other by admin. Some of our current installations treat whole datacenter as a single group, so in example above we could have 5 groups: US, Europe and 3 in Russia.
But usually things are a bit more complex: there are different power lines, different connectivity possibilities and so on, so generally group (or replica) is a smaller set of machines in one datacenter. Nothing prevents to spread it over multiple datacenters of course.
Group node belongs to is written in config in
group = section.
When client configures itself it says what groups he wants to work with.
Elliptics client library stores this info to automatically switch between groups to find available replica. Due to eventual consistency, one of them may be unavailable or be not in sync, for this case there is timestamp for each record, which used by
read-latest set of APIs – library reads timestamps from all available replicas, selects one with the latest update time and reads data from that group.
There are 3 types of recovery process in Elliptics: hash ring recovery aka
merge, deep hash ring recovery aka
deep_merge and replica recovery aka
Merge process is basically data moving within the same group from one node to another, which happens when one or several new nodes is added (or returned back). In this case route table changes and part of ID ranges covered by given node starts belonging to new node. System should move data from old nodes to new one, since new node starts serving requests immediately after its start not waiting for data to be copied to its storage. Its help message should explain the usage.
Second recovery process is needed when one or more replicas lost its data due to disk damages or other problems. To date this process is rather time consuming (hundred of millions of records is checked and copied for roughly a week). We have to check every single key on every system to ensure that it is in sync. Moreover, when we check single server, its replicas do not considered checked (although they can be updated), so, to ensure that the whole cluster is in sync, one must check every node.
It is normal in distributed system when part of it is not available – after all we put several replicas exactly for that. So whole replica recovery may be started not too frequently. One has to determine frequency of data loss in single replica and run recovery according to that. In our systems we start it several times per month, sometimes even less frequent if we know, that things were OK.
Eventually consistent systems are very scalable – its performance by orders of magnitude higher that that in synchronous systems, and its design allows true horizontal scalability without complexities and bugs.
But it comes for price – data recovery is postponed in time, and sometimes quite for a long period. All distributed systems provide some kind of data replication, so this is usually not a problem. But having complete set of fully recovered replicas is not only safe, it also provides better performance, since clients may read data in parallel from different replicas (in those systems like Elliptics which provide this functionality).
To close the gap between data recovery and current data requests, we implemented on-demand recovery in Elliptics. This is simply a data writeback into the storage, when we have read the object and detected that one or more replicas are missing.
We preserve timestamps of the records. This method does not imply that there is no need to perform regular data check (with data recovery if needed) anymore, instead it speeds up recovery for the most actively used objects.
Starting with Elliptics
2.24 new recovery script was introduced called
dnet_recovery. It provides fast and robust way of recovery for Elliptics
networks. It's written in
python and utilizes new metadata, iterators and
new acync APIs provided by C++ binding.
It can work in both hash ring recovery and replica recovery modes. Usually
recovery consists of number of parallel steps: determination of which key
ranges needed to be recovered and from which nodes, running iterators on those
nodes for particular ranges, computation of difference between local and remote
iterator results and finally recovery of resulting difference. This process
somehow varies between
merge modes but this still describes it
enough for high-level overview.
Elliptics now distributed with
dnet_recovery script which provides very
simple, informative and robust interface for recovery process.
Node that need to be recovered specified via
-r option. It has to be in
adress:port:family. On Linux IPv4 family specified with 2
and IPv6 with 10. This is mandatory option which should be always specified.
dnet_recovery will be run via cron with
-r $(hostname -f):1025:2
Groups that should be recovered specified comma separated with
If no groups were specified then all groups are used.
Recovery speed is proportional to number of parallel workers (controlled via
-n option) and batch size (controlled via
-b option). Former should be
set to number of disks in RAID for IO-bound workloads and to number of CPUs for
cpu-bound workloads (of cause if system needs to be able to serve requests at
the time of recovery this option should be bounded even more). Latter controls
size of bulk-read/bulk-write batch. While bigger values speed up recovery they
also consume more memory (memory consumption can be computed as
avg_record_size * batch size * 2).
For speedup of recovery process one should sort data in eblobs by key via
dnet_ioclient -r $(hostname -f):1025:2 -d start which should minimize disk seeks during iteration.
Also one can speed up recovery by specifying time range for recovery via
option so that recovery will only take into an account keys that were modified
timestamp. It can be given either as absolute
1368940603) value or as relative to
now() in hours, days or weeks (e.g.
4w). Latter is useful for cron jobs - for example one
can put a cron job to recover only keys modified within last month running each
week - this will have pretty decent speed but also most likely be equivalent to
Only merge-wide option is
-o which limits merge recovery only on node provided by
–remotes. It means that only the node specified by
-r will be scanned for the keys which shouldn't be on it and then the keys will be moved to proper nodes within the group.
By default script uses
/var/tmp/dnet_recovery_%MODE% temporary directory to
store iteration results from nodes (can be changed via
Recovery log goes to tmp_dir/dnet_recovery.log (can be changed via
Script verbosity controlled via
-d option if one is given then script
switches to debug output dumping much more information on console. This option
does not affect log file verbosity which is always set to
DEBUG. Also one
can set Elliptics library verbosity via
Also recovery stats are saved to
tmp_dir/stat.txt. Stats are accessible
via HTTP protocol if
-m option was given.
There is also
-N option which turns on so-called dry-run mode that do not
actually modify any data. This can be used by admins to test recovery procedure.
For use in cronjobs
–safe option should be used in
merge mode to
skip deletion of recovered keys.
By default only one instance of any recovery type can be run. For the
whole duration of recovery file specified via
-k option is locked.
Recovery stats has several section each describes some stage of recovering except main secion. All counters in stats has 3 values:
Main section has title “monitor” and contains information about all recovery:
Iterate section has title “iterate_local” - for local node and “iterate_ip” - for remote ip node. This section contains information about iteration of node and sorting iterator result:
Diff section has title “diff_remote_ip”. This section contains information about finding differences between local keys set and remote keys set:
Recovering section has title “recover_ip”. This section contains information about recovering keys by reading keys from ip node and writing read data to local node:
All sections filled as recovery progresses through stages. First it will be iteration and sort then diff and then recovery itself. If some stage fails for some node this node wouldn't participate in rest stages. For example, if all iterations is failed then it is nothing to sort and nothing to diff and recovery, so recovery will exit with code 1.
Merge is recovery mode that tries to maintain sanity in keys placement within one hash ring (usually one failure zone e.g. datacenter). This is done by downloading routing table and computing which node is responsible for which range. Then on all nodes from specified groups we need to determine ranges which are not belong to the node. On each node it starts to iterate keys and collects keys which shouldn't be on the node. For each found key it moves the key from the node to proper node only if proper node hasn't had already newer key data. If proper node already has newer key data then the key is just removed from the iterated node.
Removal of keys from source node can be skipped if option
-S is specified.
DC recovery mode is intended for recovering outdated or missing key on elliptics node by using data on elliptics node from other groups. This recovery mode is useful for case when whole group was offline or for synchronization data in different groups. The script downloads routes table and computes recovering node ranges, determines nodes from other groups which ranges are intersect with recovering node ranges. Then runs iterators on recovering node and on all found nodes. On recovering node iterator goes on all node ranges. On nodes from other groups iterator goes on intersects the node ranges and recovering node ranges. All results are sorted and computed differents between them and recovering node iterator result. Then removes duplicated keys from results and keeps key with latest timestamp. At the end all picked keys are read from their nodes and are written to recovering node. All operation with nodes (iteration and reading/writing) are executed in process pool size of which is specified via -n option.
dnet_ioservis killed by OOM in the middle of recovery. This will be also fixed soon by waking up condition variable waiters on socket error.