Roi Amitay is the DevOps Team Leader at Matomy.
Matomy’s new DSP monitoring system is up, running, and giving the best performance possible.
After deploying a new Grafana server using the InfluxDB database as the backend, our new monitoring solution offers the following key features and benefits:
- More accurate data due to the new statsD C-based replacement
- Improved performance due to the new InfluxDB backend
- Using graphite query language means no changes are required for the dashboards
- Better security using LDAP authentication
- New Grafana 4 interface (including the new alert feature)
- HA with the ability to choose between backends inside the Grafana interface
- Big reduction in cost as we now use smaller instances, fewer and cheaper disks
- Annotations for deployments and business changes
It took a little time getting here as we first had to upgrade the monitoring solution. Read all about the process here:
Two months ago, we started having issues with Grafana, or more accurately, with the whisper backend, something we weren’t aware of back then.
It was slow and inefficient. The system experienced many crashes that caused data loss at the time, and we didn’t have any high availability or backup for the data, so it was lost.
We had to do something to resolve these issues.
The Old System
The initial system was based on a single node: a 16-core instance running on AWS. All required services run on that single node.
- statsD (https://github.com/etsy/statsd) – A network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, over udp or tcp, and sends aggregates to one or more pluggable backend services.
- carbon-cache (https://github.com/graphite-project/carbon) – Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
- whisper (https://github.com/graphite-project/whisper) – Whisper is a file-based time-series database format for Graphite.
- graphite web (https://github.com/graphite-project/graphite-web) – A highly scalable, real-time graphing system.
- grafana web (https://github.com/grafana/grafana) – Grafana is an open source, feature-rich metrics dashboard and graph editor.
First, we tried to make the best of the old system already running, something that we now understand was not possible due to the heavy traffic of metrics and concurrent users.
We changed the type of the instance and replaced the disks with an io1 type that gives better io.
The Upgrade Process
We wanted to be able to work on the new solution using the production metrics while not causing any disruptions on the old production system.
To be able to do that, we decided to move the statsD service to external instance and add a carbon-relay that would deliver the metrics to two destinations. While doing so, we saw that the statsD service, which runs on the Node.js platform, is not performing well and many metrics were lost in the process.
While looking for improved performance, we encountered brubeck as a statsD alternative (https://github.com/github/brubeck).
Brubeck is a statsd-compatible stats aggregator written in C, and promises better performance.
The data then needed to be relayed to each of the carbon-cached deamons (the old one and the new one, which we needed to set up as part of the new solution). For the relay task we first used the default carbon-relay tool, distributed as part of the Graphite project. This carbon relay was also causing performance issues and crashes, so we searched for an alternative. The one we found is called carbon-c-realy (https://github.com/grobian/carbon-c-relay), an enhanced C-implementation of carbon relay, aggregator and rewriter, and it performs better with high volumes of traffic on the expense of some features. One of the missing components was the lack of support for the Pickle Graphite metrics protocol that is used to send batches of metrics instead of sending them line by line.
To deal with the missing Pickle support, we installed another component to the system: the bucky-pickle-relay (https://github.com/jjneely/buckytools), a daemon that accepts and decodes Graphite’s Pickle protocol from the brubeck service, and will tcp stream the equivalent plaintext version to a carbon-c-relay. With that, we could receive and send data to multiple destinations.
At this point, we already started receiving more accurate metrics with the new C compiled services. However, this didn’t solve the performance and availability issues, so we decided to create a graphite cluster. The idea was that by splitting the data between nodes, we would get better performance out of the whisper database since each query will be querying less data.
The Graphite cluster consists of two (or more) nodes and a relay above them. Each node is running a number of carbon-cache daemons (depending on the number of cores). We used 8, which gets only part of the metrics and stores them locally to the whisper database. The carbon-c-relay is used to split the data between the nodes using the chosen algorithm. We first used the carbon_ch method which sends the metrics to the member that is responsible according to the consistent hash algorithm (as used in the original carbon), or multiple members if replication is set to more than 1.
Another option might be the fnv1a_ch algorithm, which is identical in behavior to carbon_ch, but it uses a different hash technique (FNV1a) that is faster but more importantly, it solves the limitation of carbon_ch for using both host and port for the members of the hash-algorithm group.
To query data out of all nodes, the graphite-web is used to collect the data from other graphite-web instances on each node.
The new cluster used a new Grafana.
Running with the cluster solution resulted in better performance but at a big cost. Data could be queried very fast using 4 nodes, but when we began to invite users to try the new system we suddenly encountered another big issue: the number of concurrent users.
It seems that Graphite was unable to support a large number of concurrent users. Trying to add memcached to cache the renders and the metrics, we encountered an issue with many misses while running the cluster.
By the end of the cluster project ended, we learned a few things:
- whisper as a backend is the root of our problem.
- The Concurrent user issue must be addressed in the next solution (using memcached).
- We need to think about the cost of the next solution as its beginning to be a big factor when scaling up.
InfluxDB (https://github.com/influxdata/influxdb), a scalable datastore for metrics, events, and real-time analytics provided the best solution to our problem.
The big problem with InfluxDB is that it required us to replace all dashboards, which can be difficult to do in some cases, and impossible in others. InfluxDB does not yet support “Join” operations between measurements and that makes some of the queries impossible.
influxgraph (https://github.com/InfluxGraph/influxgraph), a DB storage plugin for Graphite-API that was developed by the team at InfluxDB, came to the rescue. Adding memcached support and putting this inside a single docker container helped manage the access to the InfluxDB database, while translating data (metrics) coming into the database and queries.
In simple terms, this plugin translates the familiar graphite queried to influx queries so we could keep the current dashboards.
The only thing left was to add the retention and aggregations to the data going into the InfluxDB.
One of InfluxDB’s features is creating retention policies. Raw data flows into the Influx, and goes to the raw retention policy (which is configured to keep data for only 1 hour). Continuous queries run as part of the InfluxDB and aggregate the data in the raw retention policy. In our case, data first aggregated to 60sec intervals by using the “Median” function and was stored in another retention policy daily. This policy keeps the aggregated data for one day, and so on…according to the aggregation schema:
1 day -> 60 sec intervals
1 week -> 1 hour intervals
1 month -> 1 day intervals
1 year -> 1 week intervals
The influxgraph enables us to tell direct queries to the specific retention policy according to the time interval. This way Grafana will query the specific retention policy depends on the time interval wanted.
Currently the new system is running for a month with only a few small issues, mostly relevant to fine-tuning the Docker daemon. In any case, data is stored on at least one of the datastores (InfluxDB or Graphite). The system can handle a greater number of concurrent users, and data is more accurate and stores metrics for 1 year.
As with all big projects, there is still work to be done, such as monitoring the system and creating better backups for the configurations.
We also noted that the way Graphite-API stores the data is by using separate measurements. This can cause a problem when using hundreds of measurements. We are still optimizing the Graphite-API in order to decrease the number of unique measurements (using Influx templates) and speed up the system.