Today I’d like to share a peculiar request I received during one of my recent Consulting Sessions. It’s just a highly specific Business Case, but it led me to wonder: what if I could find a way to generalize it? For now, that’s just a fantasy, but let me tell you the whole story so you can judge for yourself. Maybe, after I sort my ideas out a bit, we can get something really useful.
One of our longest-term Customers recently posed a strange request to me:
– Customer: “Hi Rocco, can you help me create a script that rebuilds a Business Process?”
– Me: “Say what? Could you please elaborate a bit…”
– Customer: “I need a Business Process that goes CRITICAL if one (or more) of the services of a specific host becomes CRITICAL. The services are dynamically created by Tornado, and there are a lot of them. Also, there are a lot of hosts that need this kind of strategy. So, I need a script that can build/rebuild a Business Process: this script should be launched by Icinga as a Monitoring Plugin, and it’ll return the state of the Business Process.”
– Me: “Uh, okay… and how often should this script run? Every 5 minutes?”
– Customer: “Naaah, once per hour is fine.”
Because this customer knows NetEye really well, his thoughts went immediately to the Business Process module, since that’s obviously what its scope is. And in normal circumstances, that really would be the right answer. But… rebuild the Business Process every time we need to check its state? That’s a bit of an overkill for a Business Process, even if it only has to be done once per hour.
First of all, a BP is a static object. I’m not against the idea of dynamically creating one, but having to do so every time means we’re probably just using the wrong tool. But honestly the idea of a BP that can fetch the required objects dynamically using search criteria (or something similar) is a pretty intriguing request. So, instead of just dismissing the idea, I started thinking about some alternatives.
Okay, let’s assume that we stick strictly to this customer’s request and create the script. What should this Script do? In sequential order:
icingacli businessproccess
to check the state of this dynamically created BPicingacli businessprocess
Pretty easy. But the fact that this script has to call the Icinga API is the core issue. This introduces some potential problems:
These considerations mean the call can’t be delegated to Satellites, and can introduce latency and scalability issues because it requires time to complete and cannot be distributed across more endpoints. Also, there are some problems the script must handle:
This makes the idea of creating a Script that can be executed regularly much less appealing than imagined. In fact, it’s a complete headache: monitoring must be reliable, and this script might not meet that requirement.
The issue is: something invoked by Icinga2 (our script) needs to access data from Icinga2 itself in a reliable way, and all of this must happen when Icinga2 is not busy doing other things. So, the most logical idea is to not go outside of Icinga2 itself. In other words, Icinga2 already knows all about the monitored objects, so it should be the one to handle them, right? But how?
A colleague of mine, Patrick Zambelli, in one of his blogs created a “smarter” version of the dummy command that can reset a Service Object to its current status (see Icinga 2 DSL for Defining the Monitoring Status of Objects with Director | www.neteye-blog.com). This version can get the current status of a Service Object and return it at the same time, so I drew on it for inspiration. Therefore, the solution is easy: I just have to use Domain Specific Language (Icinga DSL).
Icinga2 can perform various actions that can be scripted using a specific language, that is the Domain Specific Language, or in short DSL. It was introduced to allow custom code execution directly inside an Icinga2 process. Note that not all kinds of code and operations can be executed. To see what can be done, look at Icinga DSL’s Language Reference and at the Library Reference. Or if you’re lazy, just be aware you can access all Objects in Icinga2 and write some log output.
You can add DSL-based code almost anywhere in Icinga2’s configuration files, but it shouldn’t be used as the main strategy: if there’s no valid and crucial reason, everything should instead be managed through Icinga Director, and Icinga Director only allows the use of Icinga DSL to calculate a Command’s Argument value. This limitation might seem a little frustrating, but it’s the right way to ensure things go smoothly.
Therefore, we can use DSL to create our dynamic BP and calculate its status, and then we can pass this result to the monitoring plugin dummy
. This plugin’s load and execution time can be measured in milliseconds, so we can achieve both simplicity and scalability.
Now we can define the requirements of the Customer’s request:
To be more flexible, the DSL code should not be restricted to returning only the CRITICAL state, but rather that state should be provided as an argument. This becomes the state returned to the check. So our script logic is:
Here’s the relevant bit of code:
host_name = macro("$host.name$")
service_name = macro("$service.name$")
skip_acknowledged = macro("$skip_acknowledged$")
state_to_check = macro("$state_to_check$")
services_list = get_services(host_name)
states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]
for (service in services_list) {
if (service.name == service_name) {
continue
}
state = service.state
current_state = states_list[state]
acknowledged = service.acknowledgement
current_state[acknowledged] += 1
}
state_count = states_list[state_to_check]
number_of_services = state_count[0] + state_count[1]
if (skip_acknowledged) {
number_of_services = state_count[0]
}
state = 0
if (number_of_services > 0) {
state = 2
}
return state
Similarly, we can write some code that can prepare understandable Plugin output, just to make things easier to understand:
host_name = macro("$host.name$")
service_name = macro("$service.name$")
skip_acknowledged = macro("$skip_acknowledged$")
state_to_check = macro("$state_to_check$")
services_list = get_services(host_name)
states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]
for (service in services_list) {
if (service.name == service_name) {
continue
}
state = service.state
current_state = states_list[state]
acknowledged = service.acknowledgement
current_state[acknowledged] += 1
}
state_count = states_list[state_to_check]
number_of_services = state_count[0] + state_count[1]
if (skip_acknowledged) {
number_of_services = state_count[0]
}
message = "No services found with the seleted state"
if (number_of_services > 0) {
message = "Found " + number_of_services + " service(s) having the selected state"
}
return message
Below is the complete code for the Command Object. If you create it and have the Monitoring Plugin check_dummy
in the right Path, you can immediately use it by creating a Service Template and two Data Fields, one for skip_acknowledged
(which should be a Boolean) and one for state_to_check
(which is an integer: 0
for OK/UP, 1
for DOWN/WARNING, 2
for CRITICAL and 3 for UNKNOWN).
And to confirm scalability, its actual execution time should really be less than 10ms for getting the status of around 2100 Service Objects. Pretty quick.
Now, this is not quite ready to be included inside the NetEye Extension Pack (NEP) Project, but if I can find a good balance between complexity and customizability, it might appear sooner than expected. Stay tuned!
object CheckCommand "count-services-in-state" {
import "plugin-check-command"
command = [ "/neteye/shared/monitoring/plugins/check_dummy" ]
timeout = 1m
arguments += {
dummy_state = {
order = 0
required = true
skip_key = true
value = {{
host_name = macro("$host.name$")
service_name = macro("$service.name$")
skip_acknowledged = macro("$skip_acknowledged$")
state_to_check = macro("$state_to_check$")
services_list = get_services(host_name)
states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]
for (service in services_list) {
if (service.name == service_name) {
continue
}
state = service.state
current_state = states_list[state]
acknowledged = service.acknowledgement
current_state[acknowledged] += 1
}
state_count = states_list[state_to_check]
number_of_services = state_count[0] + state_count[1]
if (skip_acknowledged) {
number_of_services = state_count[0]
}
state = 0
if (number_of_services > 0) {
state = 2
}
return state
}}
}
dummy_text = {
order = 1
required = true
skip_key = true
value = {{
host_name = macro("$host.name$")
service_name = macro("$service.name$")
skip_acknowledged = macro("$skip_acknowledged$")
state_to_check = macro("$state_to_check$")
services_list = get_services(host_name)
states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]
for (service in services_list) {
if (service.name == service_name) {
continue
}
state = service.state
current_state = states_list[state]
acknowledged = service.acknowledgement
current_state[acknowledged] += 1
}
state_count = states_list[state_to_check]
number_of_services = state_count[0] + state_count[1]
if (skip_acknowledged) {
number_of_services = state_count[0]
}
message = "No services found with the seleted state"
if (number_of_services > 0) {
message = "Found " + number_of_services + " service(s) having the selected state"
}
return message
}}
}
}
}
Did you find this article interesting? Does it match your skill set? Our customers often present us with problems that need customized solutions. In fact, we’re currently hiring for roles just like this and others here at Würth Phoenix.