How To Recover from a DRBD9 Metadata Split Brain Situation
As soon as you manage more than a few DRBD resources distributed over a wide set of hardware, split brain situations cannot always be avoided. Standard split brains are caused by multiple nodes having different opinions about the latest state of the data on their local disk.
Disclaimer: If applied incorrectly, commands in this blog post may potentially cause data loss. If you are unsure about any step here, please be sure to back up your data first.
Standard problematic cluster situations are commonly resolved by disconnecting and discarding the data of the faulty node, while reconnecting to the primary node as outlined below.
Standard Split Brain
In the following example we will assume a two node setup with primary.example.com being the node with good data in primary state, and faulty.example.com being the node which needs to be “fixed”.
Before proceeding with either procedure, please make sure that your primary node contains the copy of the data you want to keep!
This is the standard procedure, however sometimes this does not resolve the split brain in all cases. Sometimes the so-called “metadata”, used by DRBD to keep track of its own actions, gets corrupted.
Metadata Split Brain
If you find yourself in the situation that, after following the aforementioned procedure, the disk is still “Inconsistent” or the connection between nodes doesn’t advance further than the “Connecting” state, you’re most probably a victim of metadata corruption.
In this case you’ll need to invalidate the DRBD resource on the “faulty” node, which will overwrite the local data with data from its peers and recreate the metadata from scratch, such that after the procedure it synchronizes again with the state of the local data and remote nodes.
As mentioned in earlier posts, I strongly suggest that you use the drbdtop tool, available from the neteye-extras repository. You can use it to supervise and analyze the progress and state of the drbd resources both during and after the recovery process.
Hi, my name is Benjamin, and I'm Software Architect in the Research & Development Team of the "IT System & Service Management Solutions" Business Unit of Würth Phoenix.
I discovered my passion for Computers and Technology when I was 7 and got my first PC. Just using computers and playing games was never enough for me, so just a few months later, started learning Visual Basic and entered the world of Software Development. Since then, my passion is keeping up with the short-lived, fast-paced, ever-evolving IT world and exploring new technologies, eventually trying to put them to good use. I'm a strong advocate for writing maintainable software, and lately I'm investing most of my free time in the exploration of the emerging Rust programming language.
Author
Benjamin Gröber
Hi, my name is Benjamin, and I'm Software Architect in the Research & Development Team of the "IT System & Service Management Solutions" Business Unit of Würth Phoenix.
I discovered my passion for Computers and Technology when I was 7 and got my first PC. Just using computers and playing games was never enough for me, so just a few months later, started learning Visual Basic and entered the world of Software Development. Since then, my passion is keeping up with the short-lived, fast-paced, ever-evolving IT world and exploring new technologies, eventually trying to put them to good use. I'm a strong advocate for writing maintainable software, and lately I'm investing most of my free time in the exploration of the emerging Rust programming language.
What is DRBD? DRBD is a distributed replicated storage system for the Linux platform. It is implemented as a kernel driver, several user-space management applications, and some shell scripts. DRBD is traditionally used in high availability (HA) computer clusters, but beginning with DRBD Read More
A common issue in cluster environment is the split brain condition. A split brain occurs when some nodes of the cluster are not able to communicate properly, but instead continue to work like two separate, distinct clusters leading to data Read More
In a previous blog post, I described how Elastic Stack fits within the High-Available cluster architecture of NetEye 4 and, in particular, how the correct configuration of the Quorum is mandatory to prevent losing your data or even developing inconsistencies. Read More
Historically, NetEye Clusters were configured with DRBD as Master/Slave resources. This led to the following rather cumbersome resource configuration for an N-node cluster: $SERVICE_drbd_master ( x 1 ) $SERVICE_drbd_master_clone ( x N ) $SERVICE_drbd_fs $SERVICE_virt_ip $SERVICE Note: $SERVICE serves as Read More
During the Nagios World Conference Europe™, one of the topic that will be presented is "Real Time Cluster Synchronisation for more Reliability in High Availability Systems". Martin Loschwitz, Linbit Senior consultant, will clarify this concept introducing new Linux standards DRBD Read More