NetEye relies on many agents in order to monitor just one server, some examples are: Icinga, Telegraf, Elastic beats, GLPI agent and so on.
As a Site Reliability Engineer, I’m responsible for ensuring that all these agents run smoothly. This can involve performing repetitive and time-consuming tasks like managing configurations, deploying updates, and provisioning new resources.
Ansible is an open-source automation tool that can automate tasks across entire IT environments, from servers and workstations to network devices and cloud services. It uses a simple, agentless approach that eliminates many of the complexities and headaches associated with traditional automation tools.
Another advantage of using Ansible is the ability to create predefined playbooks, letting you easily provision new servers, or update existing ones with a single command. This can help you save time and reduce the risk of human error, as well as ensure that your infrastructure is always up-to-date and secure.
For this reason, NetEye itself is installed, updated and upgraded with Ansible playbooks created by our R&D Team.
Suppose I’ve deployed new Fedora servers to scale our business application, and now the team has asked me to monitor the resources of this machines using Telegraf.
Say we have 3 Fedora servers named server1, server2, server3, a satellite named satellite.neteye that has SSH access to those servers, and I need to collect only basic metrics like disk IO, network, etc.
Telegraf comes with multiple plugins that can be used to monitor your server. By default we use the NATS plugin output to send the metrics to the satellite. Your satellite already has the architecture to receive and send Telegraf metrics.
You only need to copy the user certificates from /neteye/local/telegraf/conf/certs/telegraf-agent.crt.pem and /neteye/local/telegraf/conf/certs/private/telegraf-agent.key.pem from your satellite to your target server, and set the Telegraf configuration that will look like this:
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_jitter = "0s"
precision = ""
debug = false
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
[[outputs.nats]]
servers = ["nats://satellite.neteye:4222"]
subject = "telegraf.metrics"
secure = true
tls_cert = "/etc/telegraf/certs/telegraf-agent.crt.pem"
tls_key = "/etc/telegraf/certs/private/telegraf-agent.key.pem"
data_format = "influx"
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
report_active = false
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
mount_points = ["/"]
tagexclude = ["fstype", "device"]
[[inputs.mem]]
tagexclude = ["mode"]
[[inputs.net]]
interfaces = ["eth0"]
tagexclude = ["interface"]
[[inputs.system]]
tagexclude = ["host", "kernel", "uptime"]
Let’s save this configuration as telegraf.conf and place it and the above certificates in the same folder. (e.g. /root/ansible-telegraf)
As this blog post suggests, we want to automate the Telegraf deployment. In order to do so we need an Ansible playbook.
The playbook is the core of the Ansible solution, where you define the necessary steps to be executed. I won’t go into details here, but it’s mainly divided in 2 parts:
For our user case, the playbook will be something like this:
---
- name: Install and configure Telegraf
hosts: server1, server2, server3
become: true
tasks:
- name: Install Telegraf
yum:
name: telegraf
state: present
- name: Configure Telegraf
copy:
src: /root/ansible-telegraf
dest: /etc/telegraf
notify: restart telegraf
handlers:
- name: restart telegraf
systemd:
name: telegraf
state: restarted
enabled: true
On each host server1, server2, server3 we will:
The playbook can be executed with the command ansible-playbook telegraf.yml -i localhost
or by using a dynamic inventory generated, for example, by an inventory tool like GLPI.
By using Ansible to automate the deployment of agents like Telegraf, IT operations engineers can save significant amounts of time and effort while ensuring that all servers are monitored in a consistent and reliable manner.
With the ability to create predefined playbooks and easily provision new servers or update existing ones with a single command, Ansible can help reduce human error, and ensure that infrastructure is always up-to-date, secure and consistent.
So, it’s time to start using Ansible and reap the benefits of automation!
Did you find this article interesting? Are you an “under the hood” kind of person? We’re really big on automation and we’re always looking for people in a similar vein to fill roles like this one as well as other roles here at Würth Phoenix.