Elliptics project was started in 2009 as a new backend to pohmelfs (version 1 those days). There were no open and mature enough NoSQL systems to date, but Amazon Dynamo was on the rise. Initial Elliptics storage system was rather academical – we did not consider ‘infinitely-growing’ storage, experimented with different routing protocols, played with various replication scenarios. There was no clear vision about how such a system should look.
Thousands of experiments later we ended up with what we have now in large set of production clusters from couple of servers to hundreds of nodes.
Elliptics was specially designed for the case of physically distributed data replicas. Even now there are no simple enough systems which can provide the same level of automatizations during datacenters replications or replica separation (hello yearly Amazon outages). One can manually create such setups in other distributed systems, but it is hard task to find out those which allow parallel write and reading balancing. Usually this is a mix of master-slave design, which may provide stronger consistency by the price or availability.
To date, Elliptics is not only a storage system. There are several layers where storage is located at the lowest one. Elliptics supports 3 low-level backends to date: filesystem (where written objects are stored as files), Eblob – a fast append-only (with rewrite support) storage and Smack – very fast backend designed for small compressible (6 different compressions are supported) objects, stored in sorted tables.
It is a very simple task to write your own backend which may store data in SQL database for example. In this case you can even work with plain SQL commands on client side, while providing data distribution across the whole cluster, but beware that complex joins may be non-consistent with parallel updates.
Elliptics uses eventual consistency model to maintain data replicas. This means that number of copies you write may not always be the same. Client receives write status for every replica system tried to save, but for example, if you configured to write 3 replicas, but only 2 of them were successfully written, third one will be synced with others sometime in future. During this period of non-consistency read may return old data or do not return anything at all (this is not a problem, since Elliptics client will automatically try another replica in this case).
General rule of thumb for eventual consistency systems is to never overwrite old data, but always write data using new keys. In this case there is virtually no non-consistency, there may only be non-complete replica, and client code automatically (well, automatically in Elliptics, in others system you may need to do it manually) switch to another replica and read data from those servers.
Eventually data will be recovered in missed replica, but until it is ready, system generally can not survive loss of another replica. So, recovery process should be frequent enough. In our production we run replica-recovery process once per several days.
In Elliptics, replica is called a group. Group is a set of nodes (or just one) which admin logically bound together. For example one group may contain all servers in single datacenter, or one group may only contain several disks in a single server. When group contains multiple nodes, they form DHT – there is ID ring from 0 to 2512 (by default), and each nodes grabs multiple ranges in this ring. In particular, when Elliptics node starts the very first time, it searches for ‘ids’ file in $history directory (it is specified in config), which contains range boundaries (this is just a set of IDs written one after another, by default each ID is 64-byte long, so each 64-bytes block in ‘ids’ file is a new boundary). If there is no such file, Elliptics node generates own – it creates new random boundary for every 100 Gb of free space it has on the data partition.
Since each node may have several ranges in DHT ring in particular group, recovery process will fetch data from multiple nodes in parallel.
There is another issue with DHT rings – when new node is added or removed, ID ranges change and some nodes may get new ranges or lost. For example when new node starts it connects to others and says that ranges from its boundaries now belong to it. When it dies, those ranges are ‘returned’ back to neighbours.
When new node comes into the cluster, it starts serving requests immediately. This means that every write will succeed, but reads likely wont until recovery process moves data for this node’s ID ranges to its new location. This recovery process is called ‘merge’ in Elliptics – this is basically a move of part of the data from several nodes to newly added one. This recovery process is cheap enough to be called once per hour or so.