This is that Time of the Year when you begin preparing all your SLA Reports to understand how your important Services behaved during the Year itself. It is like then end of a Horse Racing, when you realize if the bets you placed where right or not. And I don’t like bet too much: if you manage a Service that is critical/strategic to your Company, you should check its reliability throughout the Year to understand if you have to take action to improve it. So I was wondering, how I can do that?
Since NetEye comes with the Icingaweb2 Reporting Module out of the box, a very basic idea is to run and analyze SLA Reports more frequently: two or three times per Year (in addition to the Final one) might help you understand what happened. But this will always be a Review, allowing you to see only what already happened. Running SLA Reports it more frequently, with a shorter time frame, can help in identifying subtle Service interruptions before they significantly impact your SLA; it can also help you plan Maintenance without degrading the End User Experience too much: my Service is acting up this month, so I must reschedule a low-priority maintenance later, or the opposite (pre-empt it if everything is going well). This is starting to be more of Prevention than Review, and I like it.
To do so, the naivest idea is to run and analyze all SLA Reports more often, but this is always a Review, and a time consuming activity. I need something easier to look at and understand. Maybe an ITOA Dashboard is a better idea: sure, a Timeline with some green/red marks is better that a report with a list of outages, but querying Icinga Event History for all events every time a Dashboard changes is really expensive for a System, especially when it grows larger. Sadly, MariaDB and Icinga2 IDO Data Model are not really made for this kind work, and Grafana is insensitive to this matter. The risk is to overload the Database Backend, resulting in poor User Experience for all NetEye Users. Nevertheless, this is the right path to walk.
Then I began to look for alternative ways, maybe using InfluxDB as backend: InfluxDB is a better choice for this kind of work, and with Grafana they make a great couple. Then, I suddenly remembered that Icinga2 can send also some interesting Metadata to InfluxDB alongside Performance Data:
There is everything we need to build an ITOA Version of SLA Report, but we have to activate it (it is disabled by default). To activate it, edit file /neteye/shared/icinga2/conf/icinga2/features-enabled/influxdb.conf
: add property enable_send_metadata
and set it to true
. Here an example about how the file should be after editing.
/**
* The InfluxdbWriter type writes check result metrics and
* performance data to an InfluxDB HTTP API
*/
library "perfdata"
object InfluxdbWriter "influxdb" {
host = "influxdb.neteyelocal"
port = 8086
ssl_enable = true
username = "influxdbwriter"
password = <PASSWORD>
database = "icinga2"
flush_threshold = 1024
flush_interval = 10s
host_template = {
measurement = "$host.check_command$"
tags = {
hostname = "$host.name$"
}
}
service_template = {
measurement = "$service.check_command$"
tags = {
hostname = "$host.name$"
service = "$service.name$"
}
}
enable_send_metadata = true
}
Then, restart Icinga2 Master Service, and Icinga2 will store the required metadata alongside Performance Data wihtin InfluxDB every time a Check is executed. The downside is that disk space consumption and cardinality of InfluxDB will increase, but this is nothing we can’t handle (usually).
To get more insights, look at Icinga2 InfluxDB Writer Documentation.
Although data written in InfluxDB has the highest possible accuracy, InfluxDB and Grafana are tools for approximation, so some loss of accuracy is to be expected. Furthermore, there is no room for Event Correction: after data is sent to InfluxDB, it should be considered as immutable by normal means, so updating data already stored should not be considered as feasible.
While this can be considered a bit sad, please remember the original purpose. We are not trying to replace Icingaweb2 Reporting Module or NetEye SLM Module. We are trying to provide a tool that allow us to take action before our SLA is irreparably affected by the current State. The precise calculations are still in the domain of NetEye Reporting Modules.
Now we can query InfluxDB for Availability data. For now, let’s stick to the Real State, without involving Acknowledgements or Downtimes.
The Status of a Host/Service is stored in the same Measurement used for its Performance Data, in field state
. This field numeric and contains the very same state returned by the Monitoring Plugin used (its Return Value), so you must remember that querying for Host Availability is slightly different than querying for Service Availability, as described in the Table below. In this Blog, we will use Service Availability.
State value | Host Status | Host Availability | Service Status | Service Availability |
---|---|---|---|---|
0 |
UP | Available | OK | Available |
1 |
DOWN | Unavailable | WARNING | |
2 |
DOWN | CRITICAL | Unavailable | |
3 |
UNKNOWN | UNKNOWN |
To draw a Time Line, we can use the State timeline
panel. The Query is simple: we need to select field state
from the right measurement. But what about the Aggregation Operator? The issue is shown int the next image.
Data returned from InfluxDB to Grafana is grouped by a specific Time Interval. Since for a normal Service the value for Check Interval
is 3 minutes and Retry Interval
is 1 minute, we should use a Time Interval of 1 minute or less and missing points should be filled with previous value, to avoid gaps. The left part of the image represents this situation, but it is really optimistic. You should know that Icinga2 Scheduler is not precise enough to guarantee a point every Check Interval
/Retry Interval
. Also, an Operator might click on Check Now
button, or a Passive Check might be triggered multiple times within a few seconds. This may result in multiple points within the same Time Interval, as represented in the right part of the image. Also, if a range of several days or weeks is selected in the Dashboard’s Time Browser, the Time Interval can easily grow to 1 hour or more. Therefore, we must decide how handle multiple Points in the same Time Interval.
Since an Object State is an integer, we should not let InfluxDB to return the mean value of all points in the Time Interval: which State corresponds to 1.8? Is Critical or Warning? So, rounding is not permitted regardless of the type of rounding used (floor or ceiling). The only viable solution is to pick one value from the Time Interval and plot it. If we pick the Maximum value, we plot the Worst State occurred, and if we pick the Minimum we plot the Best one. What should we do?
In my humble opinion, I think we should plot both: two separated Time Lines, one with the Best States and one with the Worst States. So, our Real Availability is something in the middle. As you can see in the image below, with a Time Window of 30 days and a Time Interval of 30 minutes, we had only one Major Outage, visible in both Time Lines, and several Minor Outages, visible only because we are looking for the worst states. This way we can easily spot the points where we want to zoom and see more details while knowing that, with this time resolution, everything has gone almost fine. Remember that this can happen also with narrower Time Windows: based on the frequency the State of an Object changes, approximations will hide what truly happened. Now, you can easily say “Ok, let’s only plot the Worst Cases”; then, I can reply “Ok, if your Service changes state fast enough (or if you zoom out enough) you will only see a completely red bar”, that is completely useless.
Calculation of Availability Percentage is a slightly different matter. First of all, what is Availability? In its simplest form, it is the ratio of time a Service is Available to the total amount of time. Since calculating the amount of time spent in a time-based query is a bit too difficult, we should remove the concept of time itself and use grouping.
If you look at the image of the Time Line Approximation, you will surely notice that, in line two and three, for each Time Interval we have a single value. Therefore, counting the number of times a specific value appears and dividing it by the count of all returned values will do the trick. Without knowing anything about the size of the Time Window.
Since we have to display a single value, we don’t really care about the Time Interval Grafana proposes for the grouping: we can set it to a value of our liking. In this specific case, we used 1 minute, but if you want a more accurate value, you can go even further down to 30 seconds or 10 seconds, but remember one important thing: the lower the Time Interval, the higher is the number of points InfluxDB has to handle, resulting in poorer Dashboard Performance. So, keep in mind what is the typical Time Window you expect to use and test the resulting performance accordingly.
Then, let’s start by counting all values in the Time Window:
SELECT count("state")
FROM (
SELECT max("state") AS "state"
FROM "nx-c-businessprocess-process-check"
WHERE
("hostname"::tag = 'HOST_NAME' AND "service"::tag = 'SERVICE_NAME')
AND $timeFilter
GROUP BY time(1m)
fill(previous)
)
We have a Nested Query made up of two queries. The Inner Query is the same of the Pessimistic Timeline with one small adjustment: GROUP BY time
clause changes from time($__interval)
to time(1m)
, ensuring we have a 1-minute-resolution for state calculation. The Outer one is just a simple COUNT
. To make COUNT
work, we simply added an ALIAS for the single field returned by the Inner Query.
Now, let’s continue by counting how many Time Intervals have the Available state (which translate to values 0 and 1 for a Service:
SELECT count("state")
FROM (
SELECT max("state") AS "state"
FROM "nx-c-businessprocess-process-check"
WHERE
("hostname"::tag = 'HOST_NAME' AND "service"::tag = 'SERVICE_NAME')
AND $timeFilter
GROUP BY time(1m)
fill(previous)
)
WHERE "state" = 0 OR "state" = 1
Easy-peasy: just add a WHERE Clause that keeps only the values we are interested in.
To calculate the actual availability, we just need to use a Grafana Expression to perform the ration between the two queries, and that’s all. To display this number, you can use a Stat Panel, and remember to hide the two Queries, leaving only the Expression visible.
Now, by assembling both the Timeline and the Stat Panels and adding a simple Stat to show the Latest Service’s Status, we have an ITOA Dashboard with a neat layout that is easy to understand.
In this Dashboard we have status of several Services that are actually monitoring a Business Process each, but you can use this logic to perform Availability calculation on a simple Service, if you need to do so. Using the Time Picker provided by Grafana, you can have a rough calculation of the Availability in a Time Window of your choice and at the same time see some points of interest that you can zoom in to get more insights.
In the future, we will proceed to integrate Acknowledgement and Downtimes in the logic of this Dashboard, and of course provide you with useful suggestions about how to do the same.
Happy new Year and SLA Reporting!