I recently modified our failover protocol at work for MongoDB in a way that reduces the interruption from 14 seconds down to 3.5 seconds by altering election configurations ahead of controlled failovers. This was tested on a 3.4 cluster but should hold true up until modern versions. Starting in 4.0.2 it’s less valuable for controlled failovers but still useful as a tuning setting for uncontrolled failovers.
How it works
// on the primary cfg = rs.conf() cfg.settings["electionTimeoutMillis"] = 1000 cfg.settings["heartbeatIntervalMillis"] = 100 rs.reconfig(cfg) // wait 60 seconds for propagation rs.stepDown() // wait for 60 seconds for election to settle // connect to primary cfg = rs.conf() cfg.settings["electionTimeoutMillis"] = 10000 cfg.settings["heartbeatIntervalMillis"] = 1000 rs.reconfig(cfg)
This is valuable to tune also if you’re on high quality, low latency networks. You’re missing faster failovers in non-controlled circumstances every time mongo insists on waiting 10 seconds before allowing an election, even when receiving failing heartbeats.
PS - While browsing the docs I found this ^_^ which is non-intuitive since I would expect no writes to one shard but no impact to other shards. Presumably it’s a typo and
During the election process, the cluster cannot accept write operations until it elects the new primary.