17. 01. 2025 Emil Fazzi Automation, Development, Documentation, Log-SIEM

Elasticsearch Magic: Achieving Zero Downtime during User Guide Updates

In a previous blog post by one of my colleagues, we shared how we developed a powerful semantic search engine for our NetEye User Guide. This solution uses Elasticsearch in combination with machine learning models like ELSER to index and query our documentation. While the proof of concept (POC) worked great, there was a challenge that needed to be tackled before putting it in production: ensuring consistency between the deployed user guide and the search results, all while maintaining zero downtime.

In this post, we’ll dive deeper into how we adapted the POC and solved the challenge of keeping the search engine consistent and operational during a user guide deployment. More specifically, we’ll talk about how we borrowed an idea from smartphone OS updates and applied it to Elasticsearch indices for seamless updates and zero-downtime operations.

The Challenge: Maintaining Consistency During User Guide Updates

As you may know, Elasticsearch is a powerful tool for full-text search, but its architecture is not designed for atomic operations like those you’d find in traditional relational databases. This posed a significant challenge for our use case.

When the content of our user guide changes, we need to index those new documents in Elasticsearch so that search results reflect the updated content. However, we wanted to guarantee that even in the middle of an update, the search engine would always return consistent results. This means that even if a user is searching the guide while the update is happening, the results should still be valid and reflect the most current version of the guide.

Elasticsearch indexing is not atomic; there’s always a window of time during which the documents might be inconsistent with the live content of the user guide. During the update, some users might see outdated search results because the documents were still being indexed. Our goal was clear: Ensure consistency and eliminate downtime during the update process.

The Solution: Borrowing from Smartphone OS Updates

In my search for a solution, I was reminded of a technique I had encountered in the past when experimenting with custom smartphone ROMs (and basically messing up both my smartphone warranty and core functionalities). Android OS updates typically use a feature known as the A/B partition system.

The idea behind A/B partitions is simple yet powerful: there are two separate partitions (A and B) that each contain a copy of the system. When an update is released, only one partition is updated, while the other partition continues to run the current version of the system. Once the update is finished, the device switches to the updated partition on the next reboot, minimizing downtime and ensuring that the user can always interact with a stable version of the OS.

I realized this exact same approach could be applied to our Elasticsearch indices. Instead of having a single index that is constantly updated, we could use two indices to mirror the state of the user guide content.

How We Applied the A/B Partitioning Strategy to Elasticsearch

Here’s how we implemented this solution in production:

  1. Two Indices:
    • We created two Elasticsearch indices to store the user guide documents. Let’s call them user-guide-v1 (the active index) and user-guide-v2 (the staging index).
    • The active index (user-guide-v1) always contains the live content, while the staging index (user-guide-v2) is used to index the new documents when there’s an update to the guide.
  2. Update Process:
    • Whenever we develop new content for the user guide, we start by indexing it into the staging index (user-guide-v2).
    • During this time, the active index (user-guide-v1) continues to serve search requests without disruption, ensuring consistency in the search results.
  3. Switching the Indices:
    • Once the new documents are fully indexed and the deployment is complete, we switch the aliases in Elasticsearch to point from user-guide-v1 to user-guide-v2.
    • This switch happens almost instantaneously and ensures that the search server will now query the newly indexed content, while maintaining consistency with the newly deployed user guide.
  4. Rolling Back:
    • If any issues arise during the update process, we can quickly revert the alias to point back to user-guide-v1, thus switching back to the previous version without any downtime.
    • This approach also gives us the flexibility to perform rollbacks not just for the HTML content of the guide but also for the search data.

Ensuring Zero Downtime

By implementing this A/B partitioning approach, we were able to update the user guide content without interrupting the search functionality. Even when the content is being updated in Elasticsearch, users can still perform searches and get consistent results based on the currently active index.

Since Elasticsearch aliases allow for quick switching between indices, this method ensures that the search system remains up and running throughout the update process. Users will always get accurate search results corresponding to the latest version of the deployed content, with zero downtime.

Conclusion

In conclusion, by applying the A/B partitioning strategy from Android OS updates to our Elasticsearch-based search system, we solved the challenge of maintaining consistency between the user guide content and search results. This solution allowed us to deploy updates without causing any downtime or inconsistency, ensuring a seamless experience for our users.

This strategy not only improved our search functionality but also gave us the confidence to continue evolving and deploying new content without worrying about the stability of the search engine. It’s a great example of how taking inspiration from different fields can lead to innovative solutions in software engineering.

Feel free to try out the updated search functionality in our NetEye User Guide and let us know how it works for you!

These Solutions are Engineered by Humans

Are you passionate about performance metrics or other modern IT challenges? Do you have the experience to drive solutions like the one above? Our customers often present us with problems that need customized solutions. In fact, we’re currently hiring for roles just like this as well as other roles here at Würth Phoenix.

Emil Fazzi

Emil Fazzi

Software Developer, R&D Team in the "IT System & Service Management Solutions" group at Würth Phoenix.

Author

Emil Fazzi

Software Developer, R&D Team in the "IT System & Service Management Solutions" group at Würth Phoenix.

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive