At Würth Phoenix, we’re no strangers to the ever-evolving world of technology. As part of our continuous innovation process and culture, we’ve been enhancing our user guide to support Elasticsearch’s ELSER model for semantic search. The goal is to improve the efficiency and accuracy of our searches, powered by machine learning. However, in typical developer fashion, things didn’t quite go as smoothly as we had hoped.
It started with a sporadic failure of our Elasticsearch Machine Learning jobs: after a successful deployment and some initial testing, we began noticing that the jobs would sometimes fail. What initially seemed like a minor glitch turned out to be something much deeper – a mysterious issue that took me down a rabbit hole of debugging and reverse engineering, where I ultimately discovered the hard requirements for AVX2 support. By the way, check out this awesome post to learn more on our improvements on the User Guide Search!
At first, everything seemed fine. We had implemented the user guide update POCs, integrated Elasticsearch’s semantic search with ELSER, and ran some basic tests. But then, out of the blue, some nodes started failing intermittently.
After initial checks and a bit of painful thinking, we started rebooting some services, doing some basic debugging, but as you may have guessed, the problem persisted. It wasn’t until I dove deeper into the error logs that the words “Segmentation Fault” in the dmesg
logs started to catch my attention. Now, Seg Faults aren’t exactly uncommon, but the fact that they were happening pretty consistently across different nodes was a red flag. Something was off.
I decided to dig into the issue, starting with the basics. I couldn’t let this slip through the cracks, especially considering the Machine Learning job was central to the new features we were rolling out. My first step? Check the stack traces and logs on the affected nodes. What I found was pretty interesting – a consistent Segmentation Fault error that seemed tied to a particular .so
shared library used by Elasticsearch. This file, part of the PyTorch backend, was performing the heavy lifting for some of the computations that ELSER needed for semantic search.
After identifying the file, I turned to a tool that’s become my second pair of eyes: Ghidra. Ghidra is an amazing reverse engineering tool, and I wasn’t about to let a Seg Fault stymie my investigation. I opened up the libtorch_cpu.so
shared library in Ghidra and set the base address for ASLR (Address Space Layout Randomization) to make sure I was looking at the correct memory address. From there, I navigated through the assembly instructions to find the one causing the crash.
This is where things started getting interesting. As I traced through the code, I reached the piece of code (or rather assembly) that was throwing the error, and it turned out it was an instruction called VBROADCAST
. I pulled up the x86_64 manual for some detail, and what I found was both a relief and a mystery: the instruction was part of the AVX instruction set, but it was only available under specific conditions that I had overlooked at first. I knew my CPU supported AVX, so why was the code failing?
After some more digging, I realized that although the instruction really was part of the AVX instruction set, it could only be used with specific registers that were available just in the newer AVX2 extension. That’s when the pieces of the puzzle clicked into place. The CPU I was running on supported AVX, sure, but it didn’t support AVX2 – and this was the root cause of the issue.
It turns out that Elasticsearch’s ELSER model was relying on this AVX2-specific instruction to accelerate some of the more complex computations involved in semantic search. The lack of AVX2 support on certain nodes wasn’t just an inconvenience; it was preventing the machine learning jobs from completing successfully once scheduled there.
The realization was a bit of a ‘Eureka!’ moment for me, but it also meant a shift in how we approached the deployment of our new features. It wasn’t just about updating the User Guide and integrating the model; we needed to make sure the hardware was fully compatible with the AVX2 requirements!
As fellow developers, we believe in continuous innovation – and that includes solving problems in new and creative ways. As we continue to enhance our products and services, this experience with Elasticsearch’s ELSER model is a reminder of the importance of diving deep into issues when things go wrong. Whether it’s through debugging, reverse engineering, or a good old-fashioned deep dive into documentation, the journey from problem to solution is where the real innovation happens.
So, to recap: sometimes you need to go beyond the obvious. When things break, dig deep, follow the trail, and don’t be afraid to get your hands dirty. You never know what you might uncover – even if it’s something as low-level as an AVX2 instruction.
Did you find this article interesting? Does it match your skill set? Programming is at the heart of how we develop customized solutions. In fact, we’re currently hiring for roles just like this and others here at Würth Phoenix.