We had a couple of incidents in the recent months where within 4 node MongoDB replication set cluster, none of the SECONDARY node takes over as PRIMARY when PRIMARY is down; even when rest of the SECONDARY nodes can connect and veto for new PRIMARY election with appropriate priorities set.
Here is one of the simplest 4-node cluster where this happened:
- HOST1, PRIMARY
- HOST2, SECONDARY
- HOST3, SECONDARY
- HOST4, SECONDARY, PRIORITY = 0, HIDDEN = TRUE (on a different data center, mainly for data center redundancy)
CONDITIONS THAT COULD TRIGGER:
Few conditions that could trigger this situation:
- When PRIMARY is down and when none of the nodes can connect to any other node in the cluster
- When a network switch or rack went down and majority of the nodes go offline
- When everyone is still connected, can veto and ability to elect the PRIMARY; but still fails for obvious reason (bug)
- When majority of the nodes are down (2/3rd)
We had case 2 and case 3 incidents on production clusters.
FROM DESIGN POINT OF VIEW:
When you don’t have any acting PRIMARY node in the cluster; then the whole cluster is kind of un-usable as one can’t do anything with the cluster at that time. The cluster will not allow any writes nor it allows one to force a new PRIMARY election as in both the cases PRIMARY must be up and running.
From the design point of you; it makes sense not to let anyone force a new PRIMARY when there is no running PRIMARY as it consolidates the data consistency; but it makes it hard to keep the production outages once the cluster gets into this situation.
I wish 10gen/MongoDB relaxes this constraint for clusterAdmin role to adjust votes for self election or allows a forced PRIMARY election.
HOW TO AVOID FORE HAND:
In order to avoid this race condition forehand; sometimes it’s better to give more votes to individual servers who can be your potential master (make sure to set this when MASTER is still active or when cluster is completely functional or during initial replication setup). Also, one should consider data center rack and switch aware setup, so that majority stays online when you loose a rack or switch.
The best source of information is to grep for rsHealthPoll and rsMgr messages from the MongoDB error log file and look for messages like below and take appropriate action immediately by adjusting the replication set votes:
- replSet can’t see a majority, will not try to elect self
- replSet total number of votes is even – add arbiter or give one member an extra vote
- not electing self, XXXX would veto
GETTING AROUND WHEN PRIMARY IS DOWN:
But when this happens; the only way, one can bail from this situation is either by bringing the PRIMARY node back online or by clearing the current replication set and manually making any one node as PRIMARY first and then freshly joining rest of the nodes as outlined below; which is messy on high load production environments and un-acceptable at times as it causes a downtime to production.
Here are the simple steps to reset the cluster:
- Manually elect a node which is up2date and has the latest timestamp ( db.printReplicationInfo()), lets say HOST2 in this case
- Stop MongoDB service on all nodes except on HOST2
- on HOST2, drop local database:
This is needed to clear replication set information as MongoDB won’t let you delete any SYSTEM tables; but it let’s you drop the system database (in this case local).
- Now restart the instance and re-initiate the replication set freshly on HOST2 using rs.initiate() (don’t drop any data files or databases on this host). If needed, change the replication set name.
- Restart rest of the MongoDB instances (HOST3, HOST4, ..) either by deleting whole data set (completely getting rid of MongoDB data directory) or by manually syncing the data files. Be careful when deleting these databases and ensure you have all databases and collections in the PRIMARY before deleting or copy over the old data directory to some backup location; worst case you can recover if you loose the PRIMARY during the initial SYNC.
- Now add these nodes (HOST3, HOST4, ..) to the cluster using rs.add() and make sure all nodes are caught up without any errors ( rs.status()).
There is another alternative by tweaking system tables and restarting all secondary nodes; and for safety reasons will not discuss in this scope.
Hope it helps.