A scalable way to built search engine.
Greylock is a rather simple C++ search engine which will take off the burden of the engine configuration, and yet be able to scale from hundreds to billions of indexed documents and from single to hundreds servers with as little as possible operations intervention.
Having constant or increasing rate of the input documents we at Reverbrain wanted to spend as little as possible on search engine resharding, moving data around, constantly looking at shards load and so on. We just wanted to add new servers when needed, and optionally running mundane operational tasks in background (better started automatically by some agents).
Reverbrain had already developed distributed storage system Elliptics, it was originally build on a distributed hash table basis, but we introduced logical entities named buckets to wrap replication into structure which can be easily operated on. Our main goal for Elliptics was to create a really safe storage which would store replicas across physically distributed datacenters, and with buckets it could be fine-tuned for different types of data. Buckets allow horizontal scaling in the environments where there is constant or increasing rate of input data and ever increasing storage size. When adding new buckets for new servers there is no resharding or consistent hashing rebalancing, although it is possible to scale single bucket (set of replicas) by adding servers into selected buckets. Elliptics proved to handle scaling issues easily, in 2013 one of our installation had 53+ billions of records, for comparison, Facebook had 450+ billions of photos these days.
Having this background we decided to build search engine on Elliptics base. This took off all scaling, replication, distribution and recovery tasks from our shoulders. Basically, we concentrated on base search engine features, and that is what we have:
Greylock is a quite simple search engine, we created it to fill in scalability niche. Because of this we do not have embedded natural language processing like lemmatization (or simpler stemming), spelling error correction, synonim searching and so on. Instead we build microservice architecture where NLP tasks are separated into its own service. We will release it soon too.
conf/ directory among others contains
greylock_server HTTP server config and insert/select json files which are examples of the appropriate operations over HTTP. Files include all supported features. To search and index via HTTP API one has to use
/index URLs accordingly. Host and port where
greylock_server HTTP server listens for incoming connection is specified in its config in
endpoints section. The most vital part - Elliptics connection and buckets are in
application section, which will be described in details in documentation.
Greylock tutorial includes whole-stack configuration and installation process, which includes installing Elliptics, setting up buckets and greylock HTTP server as well as using simple example client with Consul locking and service discovery support.