In the world of IT monitoring, we often deal with metrics that are cumulative, or where we cannot control or modify the definition of a metric. Examples include the various Windows performance monitoring metrics for MS Dynamics AX 2013 R3 (AOS) server. These metrics are cumulative, but whenever the AOS services are restarted, or the value of the metric is too large and it overflows, the metrics are reset, starting all over again from 0.
In our use case, we would like to store not the cumulative value, which doesn’t provide much information by itself, but instead the magnitude of the increment between two measurements (the diffs). We want to be able to compare the values with historical data, detect anomalies, and be able to ignore resets and overflows of these counters. So for our objective, a diff metric is more suitable than a cumulative one.
Although we could compute the diffs when retrieving the data for visualization, it’s often inefficient and unnecessary; it would be preferable to move the computation as close as possible to the edge where data collection occurs and store only the already-computed diffs. This will allow us to optimize and distribute (it’s common to have several AOS services running at the same time) computational costs and boost performance, allowing us to visualize larger data sets and thus longer time periods.
Let’s start by considering a simple Telegraf implementation that reads cumulative performance monitor metrics from an AX server and writes to stdout:
[[inputs.win_perf_counters.object]]
ObjectName = "Microsoft Dynamics AX Object Server"
Counters = ["TOTAL NUMBER OF HITS", "NUMBER OF BYTES SENT BY SERVER","NUMBER OF CLIENT REQUESTS"]
Instances = ["*"]
Measurement = "Microsoft_Dynamics_AX_Object_Server"
IncludeTotal = true
[inputs.win_perf_counters.object.tags]
tag1 = "foo"
[[output.file]]
files = [ "stdout" ]
To compute the diffs before writing out, we’ll use Starlark, a bare-bones Python dialect that allows for simple and quick scripts. Telegraf provides a Starlark processor that’s quite easy to use with the structure of a measurement in Telegraf.
Each measurement is passed to the Starlark interpreter and converted into a Starlark Metric class. The Metric class contains four properties: metric.name
, metric.fields
, metric.tags
,
and metric.time
, automatically populated from the metric sent by Telegraf.
In our case a Metric class might look like this:
metric.name: "Microsoft_Dynamics_AX_Object_Server"
metric.tags: {"tag1": foo}
metric.fields: {"NUMBER_OF_CLIENT_REQUESTS": 100, "TOTAL_NUMBER_OF_HITS": 200, ...}
metric.time: 1629294886000000000 #in nanoseconds
We want to compute the diffs by tag for each measurement we consider. If the diff between two consecutive metrics for the same tags is negative, it means that either the counter has overflowed, or else the counter has been reset. In this case we can just set the diff to zero. Here’s the code (I’ll explain the details below):
[[processors.starlark]]
namepass = ["Microsoft_Dynamics_AX_Object_Server"]
source = '''
# state is the only global object that can be modifed. Other objects defined
# before the apply function will be frozen
state = {}
fields_dict = {
"Microsoft_Dynamics_AX_Object_Server": ["NUMBER_OF_CLIENT_REQUESTS", "NUMBER_OF_BYTES_SENT_BY_SERVER",
"TOTAL_NUMBER_OF_HITS"],
}
def apply(metric):
metric_hash = str(sorted(metric.tags.items())) + metric.name
# first time we see metric_hash last will be None
last = state.get(metric_hash)
# cache of the last metric by tag
state[metric_hash] = deepcopy(metric)
# Starlark does not allow you to modify directly the input metric dict,
# we need to copy it to add fields
# add the new fields, default to 0
for f in fields_dict[metric.name]:
metric.fields[f + "_diff"] = 0
# compute diff only if there is a previous metric to compute diff with
if last != None:
for f in fields_dict[metric.name]:
diff = metric.fields[f] - last.fields[f]
# if there has been a reset or an overflow diff will be negative,
# in this case we want to set it to zero.
metric.fields[f + "_diff"] = diff if diff >= 0 else 0
return metric
'''
Since we want to compute the diffs by tag, we need to keep the last metric we received for each possible metric and tag combination. We store it in the only available globally editable dictionary, state
, using as the key a poor-man’s hash, by simply sorting and concatenating the tags and the metric name (metric_hash
).
The first time we see a metric, the corresponding value in state
will be None
, so we compute the diff only when the next metric with the same metric_hash
is delivered, otherwise the value will be 0 by default.
Since this processor could be used for other measurements too, we also define another dict, fields_dict
, to decide which fields to process for each measurement that is passed to the processor through the namepass
parameter. For all the fields defined in fields_dict
we add a new field with the same field name and the suffix _diff
.
In this article I showed how you can deal with strangely behaving cumulative metrics coming from Windows’ perfmon counters, and that with a simple processor it’s possible to extract a diff metric without the need to write a custom plugin, and with just some very basic knowledge of Python. Starlark is a quite powerful, quick and easy tool for this kind of simple data pre-processing. In a future blog post, I’ll show you other, more advanced usages of the Starlark processor.