30. 11. 2024 Rocco Pezzani Business Service Monitoring, NetEye, Unified Monitoring

The Story of a Strange Business Process

Today I’d like to share a peculiar request I received during one of my recent Consulting Sessions. It’s just a highly specific Business Case, but it led me to wonder: what if I could find a way to generalize it? For now, that’s just a fantasy, but let me tell you the whole story so you can judge for yourself. Maybe, after I sort my ideas out a bit, we can get something really useful.

A Service to check all Services

One of our longest-term Customers recently posed a strange request to me:

– Customer: “Hi Rocco, can you help me create a script that rebuilds a Business Process?”
– Me: “Say what? Could you please elaborate a bit…”
– Customer: “I need a Business Process that goes CRITICAL if one (or more) of the services of a specific host becomes CRITICAL. The services are dynamically created by Tornado, and there are a lot of them. Also, there are a lot of hosts that need this kind of strategy. So, I need a script that can build/rebuild a Business Process: this script should be launched by Icinga as a Monitoring Plugin, and it’ll return the state of the Business Process.”
– Me: “Uh, okay… and how often should this script run? Every 5 minutes?”
– Customer: “Naaah, once per hour is fine.”

Because this customer knows NetEye really well, his thoughts went immediately to the Business Process module, since that’s obviously what its scope is. And in normal circumstances, that really would be the right answer. But… rebuild the Business Process every time we need to check its state? That’s a bit of an overkill for a Business Process, even if it only has to be done once per hour.

First of all, a BP is a static object. I’m not against the idea of dynamically creating one, but having to do so every time means we’re probably just using the wrong tool. But honestly the idea of a BP that can fetch the required objects dynamically using search criteria (or something similar) is a pretty intriguing request. So, instead of just dismissing the idea, I started thinking about some alternatives.

Script specifications

Okay, let’s assume that we stick strictly to this customer’s request and create the script. What should this Script do? In sequential order:

Call Icinga API and get the list of required Objects
Build a BP where all these Objects are tied together with an AND rule
Invoke icingacli businessproccess to check the state of this dynamically created BP
Return the state obtained from icingacli businessprocess

Pretty easy. But the fact that this script has to call the Icinga API is the core issue. This introduces some potential problems:

The Icinga2 API might be slow to respond
The Icinga2 API call must be done against the all-knowing Master Host

These considerations mean the call can’t be delegated to Satellites, and can introduce latency and scalability issues because it requires time to complete and cannot be distributed across more endpoints. Also, there are some problems the script must handle:

The Icinga2 API might not respond at all (because of a deployment, a restart, or a PCS resource relocation)
The Icinga2 API might return the wrong result (incomplete or empty response, various HTTP errors, and so on)

This makes the idea of creating a Script that can be executed regularly much less appealing than imagined. In fact, it’s a complete headache: monitoring must be reliable, and this script might not meet that requirement.

The most logical alternative

The issue is: something invoked by Icinga2 (our script) needs to access data from Icinga2 itself in a reliable way, and all of this must happen when Icinga2 is not busy doing other things. So, the most logical idea is to not go outside of Icinga2 itself. In other words, Icinga2 already knows all about the monitored objects, so it should be the one to handle them, right? But how?

A colleague of mine, Patrick Zambelli, in one of his blogs created a “smarter” version of the dummy command that can reset a Service Object to its current status (see Icinga 2 DSL for Defining the Monitoring Status of Objects with Director | www.neteye-blog.com). This version can get the current status of a Service Object and return it at the same time, so I drew on it for inspiration. Therefore, the solution is easy: I just have to use Domain Specific Language (Icinga DSL).

A solution with Icinga DSL

Icinga2 can perform various actions that can be scripted using a specific language, that is the Domain Specific Language, or in short DSL. It was introduced to allow custom code execution directly inside an Icinga2 process. Note that not all kinds of code and operations can be executed. To see what can be done, look at Icinga DSL’s Language Reference and at the Library Reference. Or if you’re lazy, just be aware you can access all Objects in Icinga2 and write some log output.

You can add DSL-based code almost anywhere in Icinga2’s configuration files, but it shouldn’t be used as the main strategy: if there’s no valid and crucial reason, everything should instead be managed through Icinga Director, and Icinga Director only allows the use of Icinga DSL to calculate a Command’s Argument value. This limitation might seem a little frustrating, but it’s the right way to ensure things go smoothly.

Therefore, we can use DSL to create our dynamic BP and calculate its status, and then we can pass this result to the monitoring plugin dummy. This plugin’s load and execution time can be measured in milliseconds, so we can achieve both simplicity and scalability.

The DSL script

Now we can define the requirements of the Customer’s request:

I have a Host with several Services
On this Host I need a Service
This Service must get the status of all services on that same Host excluding itself
If at least one of those services is in the CRITICAL State, this service should return CRITICAL
I must be able to include or exclude Acknowledged Services at will

To be more flexible, the DSL code should not be restricted to returning only the CRITICAL state, but rather that state should be provided as an argument. This becomes the state returned to the check. So our script logic is:

Get the state to check
Get whether the acknowledgement matters
Get all services of the host
For each Service that is not the current service:
- Get its Service State
- Get whether it’s been acknowledged (if it matters)
- Update the count of services in that state accordingly (i.e., acknowledged or not)
Get the number of services in the state to check (taking acknowledgement into account)
Return CRITICAL if this number is higher than 0

Here’s the relevant bit of code:

host_name = macro("$host.name$")
service_name = macro("$service.name$")
skip_acknowledged = macro("$skip_acknowledged$")
state_to_check = macro("$state_to_check$")

services_list = get_services(host_name)
states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]
for (service in services_list) {
	if (service.name == service_name) {
		continue
	}
	state = service.state
	current_state = states_list[state]

	acknowledged = service.acknowledgement
	current_state[acknowledged] += 1
}

state_count = states_list[state_to_check]
number_of_services = state_count[0] + state_count[1]
if (skip_acknowledged) {
	number_of_services = state_count[0]
}

state = 0
if (number_of_services > 0) {
	state = 2
}

return state

Similarly, we can write some code that can prepare understandable Plugin output, just to make things easier to understand:

host_name = macro("$host.name$")
service_name = macro("$service.name$")
skip_acknowledged = macro("$skip_acknowledged$")
state_to_check = macro("$state_to_check$")

services_list = get_services(host_name)
states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]
for (service in services_list) {
	if (service.name == service_name) {
		continue
	}

	state = service.state
	current_state = states_list[state]

	acknowledged = service.acknowledgement
	current_state[acknowledged] += 1
}

state_count = states_list[state_to_check]
number_of_services = state_count[0] + state_count[1]
if (skip_acknowledged) {
	number_of_services = state_count[0]
}

message = "No services found with the seleted state"
if (number_of_services > 0) {
	message = "Found " + number_of_services + " service(s) having the selected state"
}

return message

The Command Object

Below is the complete code for the Command Object. If you create it and have the Monitoring Plugin check_dummy in the right Path, you can immediately use it by creating a Service Template and two Data Fields, one for skip_acknowledged (which should be a Boolean) and one for state_to_check (which is an integer: 0 for OK/UP, 1 for DOWN/WARNING, 2 for CRITICAL and 3 for UNKNOWN).

And to confirm scalability, its actual execution time should really be less than 10ms for getting the status of around 2100 Service Objects. Pretty quick.

Now, this is not quite ready to be included inside the NetEye Extension Pack (NEP) Project, but if I can find a good balance between complexity and customizability, it might appear sooner than expected. Stay tuned!


object CheckCommand "count-services-in-state" {
    import "plugin-check-command"
    command = [ "/neteye/shared/monitoring/plugins/check_dummy" ]
    timeout = 1m
    arguments += {
        dummy_state = {
            order = 0
            required = true
            skip_key = true
            value = {{
                host_name = macro("$host.name$")
            service_name = macro("$service.name$")
            skip_acknowledged = macro("$skip_acknowledged$")
            state_to_check = macro("$state_to_check$")
            
            services_list = get_services(host_name)
            states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]
            for (service in services_list) {
            	if (service.name == service_name) {
            		continue
            	}
            	state = service.state
            	current_state = states_list[state]
            
            	acknowledged = service.acknowledgement
            	current_state[acknowledged] += 1
            }
            
            state_count = states_list[state_to_check]
            number_of_services = state_count[0] + state_count[1]
            if (skip_acknowledged) {
            	number_of_services = state_count[0]
            }
            
            state = 0
            if (number_of_services > 0) {
            	state = 2
            }
            
            return state
            }}
        }
        dummy_text = {
            order = 1
            required = true
            skip_key = true
            value = {{
                host_name = macro("$host.name$")
            service_name = macro("$service.name$")
            skip_acknowledged = macro("$skip_acknowledged$")
            state_to_check = macro("$state_to_check$")
            
            services_list = get_services(host_name)
            states_list = [[0, 0], [0, 0], [0, 0], [0, 0]]
            for (service in services_list) {
            	if (service.name == service_name) {
            		continue
            	}
            
            	state = service.state
            	current_state = states_list[state]
            
            	acknowledged = service.acknowledgement
            	current_state[acknowledged] += 1
            }
            
            state_count = states_list[state_to_check]
            number_of_services = state_count[0] + state_count[1]
            if (skip_acknowledged) {
            	number_of_services = state_count[0]
            }
            
            message = "No services found with the seleted state"
            if (number_of_services > 0) {
            	message = "Found " + number_of_services + " service(s) having the selected state"
            }
            
            return message
            }}
        }
    }
}

These Solutions are Engineered by Humans

Did you find this article interesting? Does it match your skill set? Our customers often present us with problems that need customized solutions. In fact, we’re currently hiring for roles just like this and others here at Würth Phoenix.