Reverbrain wiki

Site Tools


elliptics:monitoring

Monitoring

The document contains external overview of Elliptics server monitoring: how to request statistics and how to read received statistics.

For developers details please read monitoring inside article.

Introduction

Monitoring subsystem was added to Elliptics server for tracking state of the server and for determining how fast it handles commands or why it stuck on some command. Monitoring statistics is available via simple REST API on monitor port. Statistics can be request for one of the existing categories:

  • ALL - all available statistics, collected from all categories;
  • CACHE - internal cache statistics;
  • IO_QUEUE - statistics of queries: input of clients requests and outputs of replies;
  • COMMANDS - handled commands statistics: summary properties of all commands and history of handled commands with parameters
  • IO_HISTOGRAMS - 2D histograms for handled READ, WRITE, INDEX_UPDATE and INDEX_UPDATE_INTERNAL commands.

For monitoring categories details looks to statistics data format

Monitoring configuration

To turn on monitoring the “monitor_port” parameter should be specified in dnet_ioserv config file. If it isn't specified or if it is equal to zero the monitoring subsystem wouldn't created and initialised. If it is specified monitoring will be initialised for listening incoming connection on “monitor_port”.

Simple REST API

Monitoring includes simple implementation of REST API HTTP 1.0 from incoming connections. It receives packet and tries to grab URI from first line and match it to one of the existing categories or to request the list of existing categories. If first line of request is bad formatted then monitoring will reply with the HTTP 400 response. If request contains URI of unknown category then monitoring will reply with the HTTP 404 response. Otherwise monitoring will reply with the HTTP 200 response with statistics json or list of existing categories.

It is important to understand that monitoring HTTP server is very simple, it doesn't validate all packet and doesn't supports any headers, it just gets first line, skips “GET” and parse URI.

It is OK to use some commandline tool for getting statistics from monitoring like curl, wget etc.

$ curl hostname:20000/list
$ curl hostname:20000/all

JSON statistics format

All responses on monitoring statistics request have HTTP 200 JSON output format. Root json element has 'time' field and one field per requested category. 'time' field contains time was past from previous statistics request.

ALL statistics format

ALL statistics it is combined statistics of all categories. It looks like:

{
"cache": {...}, // CACHE statistics
"io_queue_stat": {...}, // IO_QUEUE statistics
"commands_stat": {...}, // 'commands_stat' part of COMMANDS statistics
"history_stat": {...}, // 'history_stat' part of COMMANDS statistics
"histogram": {...} // IO_HISTOGRAMS statistics
}

CACHE statistics format

CACHE statistics includes different properties of how cache is working. Its element names 'cache' and looks like:

"cache": {
    "number_of_objects": 0, // number of objects in cache
    "size_of_objects": 0, // size of all objects in cache
    "number_of_objects_marked_for_deletion": 0, // number of objects marked for deletion which would be deleted soon
    "size_of_objects_marked_for_deletion": 0, // size of objects marked for deletion
    "total_lifecheck_time": 1664569,
    "total_write_time": 0,
    "total_read_time": 0,
    "total_remove_time": 0,
    "total_lookup_time": 0,
    "total_resize_time": 0,
    "caches": [ // per cache statistics, be default cache consists of 16 independent caches
        { // statistics of first of 16 independent cache
            "number_of_objects": 0,    // number of objects in the independent cache
            "size_of_objects": 0, // size of objects in the independent cache
            "number_of_objects_marked_for_deletion": 0, // number of objects marked for deletion in the independent cache
            "size_of_objects_marked_for_deletion": 0, // size of objects marked for deletion in the independent cache
            "total_lifecheck_time": 107405,
            "total_write_time": 0,
            "total_read_time": 0,
            "total_remove_time": 0,
            "total_lookup_time": 0,
            "total_resize_time": 0
        },
        ... // the same statistics for the rest 15 independent caches.
    ]
}

IO_QUEUE statistics format

IO_QUEUE statistics include information about input (later output will be provided too) queue and how many packets was queued before handling. Its element names 'io_queue_stat' and looks like:

"io_queue_stat": {
    "size": 0, // current size of input io queue
    "volume": 0, //
    "min": 0, // minimum size of input io queue from the last io queue statistics request
    "max": 0, // maximum size of input io queue from the last io queue statistics request
    "time": 1390827940905268 // time from the last io queue statistics request
}

COMMANDS statistics format

COMMANDS statistics consists of two elements: 'commands_stat' and 'history_stat'.

'commands_stat' is per command statistics and answers how many time the command was succeeded / failed while was handled by backend/cache and how many of them was created by client. Also it contains size of data which took a part in the command handling and how many time it took.

'history_stat' is chronological list of handled commands with properties of command handling: size of data, time of handling, subsystem which satisfied the command and was it created by client.

'commands_stat' and 'history_stat' and looks like:

"commands_stat": {
    "COMMAND": { // COMMAND statistics where COMMAND is on of elliptics commands, for example: REVERSE_LOOKUP, JOIN, WRITE, READ, CHECK etc.
        "cache": { // statistics of COMMANDs which was satisfied by cache subsystem
            "successes": 0, // number of COMMANDs which was succeeded while handling it by cache
            "failures": 0 // number of COMMANDs which was failed while handling it by cache
        },
        "cache_internal": { // number of COMMANDS which was satisfied by cache subsystem, but wasn't created by clients
            "successes": 0, // number of COMMANDs which was succeeded while handling it by cache
            "failures": 0 // number of COMMANDs which was failed while handling it by cache
        },
        "disk": { // statistics of COMMANDS which was satisfied by backend (eblob or other)
            "successes": 0, //  number of COMMANDs which was succeeded while handling it by backend
            "failures": 0 // number of COMMANDs which was failed while handling it by backend
        },
        "disk_internal": { // statistics of COMMANDS which was satisfied by backend (eblob or other), but wasn't created by clients
            "successes": 0, //  number of COMMANDs which was succeeded while handling it by backend
            "failures": 0 // number of COMMANDs which was failed while handling it by backend
        },
        "cache_size": 0, // size of data which was read, write or just processed by cache while handling the COMMAND
        "cache_intenal_size": 0, // size of data which was read, write or just processed by cache while handling the COMMAND which wasn't created by client
        "disk_size": 0, // size of data which was read, write or just processed by backend while handling the COMMAND
        "disk_internal_size": 0, // size of data which was read, write or just processed by backend while handling the COMMAND which wasn't created by client
        "cache_time": 0, // time which has been spent by cache while handling the COMMAND
        "cache_internal_time": 0, // time which has been spent by cache while handling the COMMAND which wasn't created by client
        "disk_time": 0, // time which has been spent by backend while handling the COMMAND
        "disk_internal_time": 0 // time which was spent by backend while handling the COMMAND which wasn't created by client
    },
    ... // the same statistics for all existing commands
},
"history_stat": [ // history of handled commands
    {
        "COMMAND": { // COMMAND is the name of executed command
            "internal": "true", // is it created by server itself or by client
            "cache": "false", // was it satisfied by cache
            "size": 0, // size of data which was included in COMMAND handling
            "time": 150 // time which was spent to handle the COMMAND
        }
    },
    ... // the same statistics for each executed command in chronological order
]

IO_HISTOGRAMS statistics format

IO_HISTOGRAMS statistics contains 2D histograms of handled read, write, index_update, index_update_internal. Each command histogram contains 5 snapshot, each of them contains information of 1 second period from last 5 seconds. And one extra snapshot which was filled from last IO_HISTOGRAMS statistics request. Each snapshot is 2D histogram data with size on the one dimension and time on the second dimension. The size dimension is how many data took a part in command handling. The time dimension is the time was spent to handle the command. Histogram cell (X, Y) contains counter of how many commands was handled with size from range X and by time from range Y.

IO_HISTOGRAMS names 'histogram' and looks like:

"histogram": {
    "read": { // the name of commands which is the owner of histograms
        "disk": { // the subsystem which satisfies commands of histograms and who created the commands of histograms. 'disk' - commands which was satisfied by backend and was created by client, 'cache' - commands which was satisfied by cache and was created by client, 'disk_internal' - commands which was satisfied by backend and wasn't created by client, 'cache_internal' - commands which was satisfied by cache and wasn't created by client.
            "snapshots": [ // one of 5 snapshot which has one second accumulative 2D histograms
                {
                    "<100 bytes": { // statistics of reads which has data with size less then 100 bytes
                        "<500 usecs": 0, // number of reads which was handled faster then 500 usecs
                        "<5000 usecs": 0, // number of reads which was handled faster then 5000 uses but longer then 500 usecs
                        "<100000 usecs": 0, // number of reads which was handled faster then 100000 uses but longer then 5000 usecs
                        ">100000 usecs": 0 // number of reads which was handled longer then 100000 uses
                    },
                    ... // the same statistics for data with size between 100-500 bytes, 500-1000 bytes and more then 1000 bytes.
                    "time": { // timestamp of snapshot creation
                        "tv_sec": 1390829223,
                        "tv_usec": 260364
                    }
                },
                ... // other 4 snapshots
            ],
            "last_snapshot": { // last snapshot which was filled from last histograms statistics request
                ... the same statistics like in "snapshot"
            }
        },
        ... // the same statistics like 'disk' for 'cache', 'disk_internal' and 'cache_internal'
    },
    ... //the same statistics like 'read' for 'write', 'indx_update' and 'indx_internal'
}
elliptics/monitoring.txt · Last modified: 2014/01/27 22:57 by shaitan