20. 12. 2022 Alessandro Valentini DevOps

One Year as a DevOps Engineer

The Beginning

When I started my new role as a DevOps Engineer at the beginning of 2022, we had little experience in DevOps as a team. We tried several times to implement automation in order to simplify maintenance and reduce the amount of overall effort we invested in routine activities. However, since DevOps engineers are also part of the development team, their efforts are often re-targeted from internal infrastructure to NetEye improvements or new NetEye features.

NetEye is a full Linux distribution, requiring a plethora of tools to be be deployed: CI to test and build packages, pipelines to build Docker images for development, repositories to host ISOs and RPMs, the user guide, etc…

As a result we’ve built up some huge infrastructure, often implemented by different people and at different times, and usually without enough time to do it in the best way.

Sometimes we have Ansible playbooks which do part of the work and that you can customize for specific tasks on specific machines. In most cases we also have outdated documentation which relies heavily on the expertise of several colleagues who know how to solve the issues that arise from time to time.

Pull Out of Chaos

So, we have complex infrastructure, which nobody was really confident enough to touch because of the fear of breaking something and impacting either the customer, the entire development team, or both.

Finally, we made a decision: create a dedicated DevOps team.

Our first task was to change all the infrastructure to support the upgrade from CentOS 7 to RHEL 8. This forced us to go through all our infrastructure and really understand how it worked. Everything was impacted by the upgrade: CI, ISO, RPMs, containers, virtual machines, and also the user guide since our documentation is released as packages. It was a huge effort, but also a great opportunity to really go in depth on our infrastructure, and rewriting or cleaning up some parts.

It was immediately evident that continual effort was required to keep everything up and running, consuming time in routine activities or handling emergencies instead of improving the infrastructure.

Therefore we introduced reserved weekly time slots to do periodic maintenance. It may seem like a trivial task, but just updating and rebooting the machines highlights many issues and forces us to solve them, helping us to define maintenance procedures and, more importantly, to minimize manual steps for each machine.

Furthermore, all the emergencies were documented in a troubleshooting guide in order to define common use cases and also allow external people to manage them.

Starting with DevOps

After this initial experience we set forth some rules for any new machines we had to deploy:

Every setup must be reproducible
We must have a testing environment to try out changes
We must track every change in configuration

All these rules can be achieved in the same way: automation.

So now we started with real DevOps activities: everything new can be configured manually as a PoC, but it must then be re-implemented in Ansible in order for us to consider the related activity completed.

This brings several advantages:

You don’t need to write complex guides explaining how to set up a specific machine. We just have high level documentation, usually written during Low Level Design, which explains the architecture and requirements. Everything else is inside playbooks, where the only documentation required is a short README to explain which parameters can be specified.
Having everything implemented as code allows us to deploy testing machine in minutes, used for tests until the correct setup is found. Then we throw it away. If something goes wrong we just have to redeploy the test machine and retry.
Every change must be deployed in production by changing Ansible scripts. In this way we’re sure to version all changes and deploy them on every machine involved. This also guarantees that test machines are always aligned with productive machines, ensuring maximum reliability. In this way we are also able to quickly redeploy the machine for disaster recovery, even in the case of physical machines.

In the effort of standardize our infrastructure we noticed that many setups are common: for instance, to deploy a reverse proxy, we always use same version of Nginx, open certain ports and configure SELinux. When we noticed that those activities were common to most of the machines, we moved tasks from multiple playbooks to a dedicated role and now we just invoke that role to set everything.

We do the same thing for several other configurations, like basic software (we want vim, nettools, tcpdump etc… on every machine). If something is missing we can add it to the list and, during the first update, all the changes will be deployed, adding required software. Since Ansible ensures idempotency we can also use these playbooks to run periodic updates.

Once those scripts become available and well tested, we noticed that we then still have a machine properly set up to run them. In particular they may be executed on the latest Fedora, RHEL or Windows releases. And then every time we had to tackle different Ansible or python versions, dependencies etc…

Therefore we decided to move everything possible to a container, and this heavily simplified the usage. The container now works out-of-the-box on every system, the base image can be easily reused and, if something breaks, you can throw it away and run a clean version, without caring about environment setup. All you need is Docker. Furthermore it’s much easier to integrate a container in a CI process, for example to run periodic tasks.

In conclusion I learned a lot during this year, and understand much better the importance of automation and DevOps philosophy. For the near future we’re aiming to move everything possible to OpenShift including CI, Docker registry, and our internal repository which are all still scattered across many machines. Using a cluster environment will improve availability and scalability for our internal services, further reducing the effort needed to handle emergencies.

These Solutions are Engineered by Humans

Did you find this article interesting? Are you an “under the hood” kind of person? We’re really big on automation and we’re always looking for people in a similar vein to fill roles just like this and other roles here at Würth Phoenix.