Reverbrain wiki

Site Tools


greylock:tutorial

Greylock tutorial

This tutorial will cover full-stack setup of the search engine including Elliptics, Consul and Greylock itself. There will also be a section describing simple client access for indexing and searching documents.

Installation

Elliptics and Greylock packages can be installed using our Server setup tutorial All Reverbrain packages can be found in our repositories.

Consul - locking and discovery service - can be installed via simple steps described in its “Getting Started guide

We will also need Backrunner, or actually its bmeta tool, steps needed to install it can be found in small tutorial

Configuration

Elliptics

Elliptics configuration is described in the appropriate section, in particular, there is full-featured config file. This config setups 2 backends, we will use them to create new bucket.

Bucket

Here is bucket config, it is essentially the same as described in Backrunner bucket section, you can refer to it to find out what are the parameters.

bucket_create.json
{
	"groups": [1,2],
	"acl": [
		{
			"user": "*",
			"token": "secure unused wildcard token",
			"flags": 7
		},
		{
			"user": "writer",
			"token": "secure token",
			"flags": 7
		}
	],
	"flags": 0,
	"max-size": 0,
	"max-key-num": 0
}

Note that groups array corresponds to group IDs specified in backend section in Elliptics config. When writing data into this bucket, 2 replicas will be created for every key, they will be stored in groups 1 and 2.

To upload bucket into elliptics, there is bmeta tool:

$ cd backrunner
$ bmeta -config config.json -upload bucket_upload.json -bucket b1

Where b1 is bucket name, bucket_upload.json is above file and config.json is backrunner config

Greylock

Now when we have Elliptics installed and we created bucket named 'b1', its time to setup Greylock.

Here is its config, which also can be found in source tree: https://github.com/reverbrain/greylock/blob/master/conf/server-config.json

greylock.conf
{
    "endpoints": [
        "0.0.0.0:8080"
    ],
    "backlog": 512,
    "threads": 2,
    "buffer_size": 65536,
	"logger": {
        "level": "info",
        "frontends": [
            {
                "formatter": {
                    "type": "string",
                    "pattern": "%(timestamp)s %(request_id)s/%(lwp)s/%(pid)s %(severity)s: %(message)s, %(...L)s"
                },
                "sink": {
                    "type": "files",
                    "path": "greylock.log",
                    "autoflush": true,
                    "rotation": { "move": 0 }
                }
            }
        ]
    },
    "daemon": {
        "fork": false,
        "uid": 1000
    },
    "monitor-port": 21235,
    "request_header": "X-Request",
    "trace_header": "X-Trace",
    "application": {
	"remotes": [
		"localhost:1026:2"
	],
	"metadata-groups": [
	    1,2
	],
	"buckets": [
	    "b1"
	],
	"meta-bucket": "b1",
	"max-page-size": 6144,
	"reserve-size": 1536
    }
}

The most interesting parts are

  • logger - greylock.log is actual log path
  • remotes - array of addresses where Elliptics servers listen for incoming connections, its similar to remote section in Elliptics config |
  • metadata-groups - array of Elliptics groups where bucket metadata is stored, it is the same groups as were specified in Backrunner config file when we uploaded our b1 bucket above
  • buckets - array of buckets we have uploaded in the previous step, in our case it is just a single bucket b1
  • meta-bucket - bucket where index metadata will live

Other options are almost never changed, so we will not describe them all here.

Consul

After we have configured Greylock its time to setup Consul discovery and locking service. It is not required for search engine, but it is used to lock indexes in our example code, since Greylock does not provide locking among different users. If you will only use single Greylock HTTP server, this section can be omitted. If multiple Greylock HTTP servers will be started, client has to lock index in some distributed consensus engine like Consul (this can also be Zookeeper or etcd) and update indexes afterwards.

Greylock contains all needed Consul config files. To find out more details on Consul options you can check their Getting Started Guide.

Starting services

Elliptics
# dnet_ioserv -c ioserv.conf
Greylock
# greylock_server -c greylock.conf
Consul

We will only start a single-node Consul agent, it is not very good from safety point of view though. For more details on how to setup high-available cluster check out Consul documentation.

# cd greylock
# mkdir /tmp/consul
# consul agent -server -bootstrap-expect 1 -config-dir conf/consul.d/ -data-dir /tmp/consul/

Running queries

These examples do not require Consul, it is a simple single-server setup.

$ cd greylock
$ curl -d @conf/insert.json http://localhost:8080/index
$ curl -d @conf/search.json http://localhost:8080/search
{
    "ids": [
        {
            "key": "document elliptics key",
            "bucket": "document elliptics bucket",
            "id": "document ID",
            "relevance": 0.5,
            "timestamp": {
                "tsec": 1440696489,
                "tnsec": 1234
            }
        }
    ],
    "completed": true,
    "paging": {
        "num": 1,
        "start": ""
    }
}

If you check conf/insert.json you will find, that it has indexed all fields found in index section. All fields in index object will be split into key/value pairs where 'key' is attribute name (attribute key, another key, text key and so on) and 'value' is text strings split to single-word tokens (value, another and so on). When searching one has to specify attribute names and values. Search and indexing are performed within mailbox namespace.

When search completes, it has returned array of ids where each object corresponds to document (described by key, bucket and id tuple, compare them to the appropriate fields in conf/insert.json) where all required keys and values are present.

All queries are case sensitive, thus Search and search are different terms.

Running client code with Consul

$ cd greylock
$ python src/consul_client.py --consul-url http://localhost:8500 --mailbox some-mailbox-name --id some-document-id --file conf/insert.json
All indexes for document 'some-document-id' have been successfully updated
$ python src/consul_client.py --consul-url http://localhost:8500 --mailbox some-mailbox-name --search "searching"
completed: True
paging: num: 1, start: ''
bucket: '', key: '', id: 'some-document-id', relevance: 1.000000, ts: 2015-09-19 02:41:49.082503127

This client src/consul_client.py locks/unlocks whole mailbox while waiting for all indexes to be updated. There is an API in src/consul.py to wait on lock, continue it (all locks are actual leases, they are taken for some time) or break the lock.

greylock/tutorial.txt · Last modified: 2015/09/19 02:52 by zbr