06 Apr 2023

3x Faster Mongodb Controlled Failovers

I recently modified our failover protocol at work for MongoDB in a way that reduces the interruption from 14 seconds down to 3.5 seconds by altering election configurations ahead of controlled failovers. This was tested on a 3.4 cluster but should hold true up until modern versions. Starting in 4.0.2 it’s less valuable for controlled failovers but still useful as a tuning setting for uncontrolled failovers.

How it works

The premise is to make the shard call a new election as fast as possible by reducing electionTimeoutMillis and heartbeatIntervalMillis.

Procedure:

// on the primary
cfg = rs.conf()
cfg.settings["electionTimeoutMillis"] = 1000
cfg.settings["heartbeatIntervalMillis"] = 100
rs.reconfig(cfg)

// wait 60 seconds for propagation
rs.stepDown()

// wait for 60 seconds for election to settle
// connect to primary

cfg = rs.conf()
cfg.settings["electionTimeoutMillis"] = 10000
cfg.settings["heartbeatIntervalMillis"] = 1000
rs.reconfig(cfg)

This is valuable to tune also if you’re on high quality, low latency networks. You’re missing faster failovers in non-controlled circumstances every time mongo insists on waiting 10 seconds before allowing an election, even when receiving failing heartbeats.

PS - While browsing the docs I found this ^_^ which is non-intuitive since I would expect no writes to one shard but no impact to other shards. Presumably it’s a typo and cluster means replicaset.

During the election process, the cluster cannot accept write operations until it elects the new primary.

28 Nov 2021

Mongo Deployment Experiment Using ASG, Consul and Terraform

Introduction

I’m exploring Terraform to deploy MongoDB sharded clusters. Aka, binge coding Terraform, Packer and Consul was a delightfully obsessive way to spend the recent holidays. I’m learning a lot of Terraform, Consul and Packer along the way :). Terraform’s impressive and I’m enjoying architecting a robust system from AWS’s building blocks.

Design

The principle this system follows is self-healing and immutability.

Components:

  1. Packer - builds server images of the core functionality (mongod shardsvr, mongod configsvr, mongos) on top of a base image of Ubuntu LTS with consul agents pre-configured.
  • Terraform - deploys, updates and deletes AWS infrastructure:
    • SSH keys
    • Security groups
    • VPCs and Subnets
    • Auto-scale groups + Launch configurations
    • EBS Volumes (boot disks and db data storage)
  • Consul Servers - these are the 3-5 servers which form the stable ring of consul server elements.
  • Consul - each mongo server has consul setup with auto-join functionality (aka retry_join: [provider=aws...") based on aws tagging.
    • Used for dynamic DNS discovery in conjunction with systemd-resolved as a DNS proxy.
    • Will use consul-template to update config files on servers post launch. (I’m hoping this is an elegant solution for ASG booted instances that need final configuration after launch and a way to avoid having to roll the cluster for new configurations.)
  • Auto-scale groups (ASG)
    • Each mongo instance is an auto-scale group of 1.
    • ASG monitors and replaces instances that become unhealthy
  • Auto re-attach of EBS data volume
    • In the event that a mongo instance becomes unhealthy, ASG replaces the node but it will initially lack the db data bearing EBS volume.
    • That EBS volume is prohibitive to recreate on large volumes when considering the restoration time + needing to dd the full drive to achieve normal performance.
    • Instead of a new volume, a cronjob runs on each data configsvr and shardsvr that each minute tries to re-attach the EBS db data volume paired to this instance using metadata from the EC2 instance tags and the EBS volume tags.
    • The cronjob looks up the required metadata and executes aws-volume-attach.
    • If the volume is currently attached, aws-volume-attach is a no-op.
  • EBS Volumes
    • DB Data volumes are separately deployed and persist after separation from their instance.
    • These will be in the terrabyte size range.
    • To replace a drive (corruption/performance issues)
      • Provision an additional drive from snapshot using terraform
      • Update the metadata of that shard’s replicaset member to point to the new drive’s name
      • terraform apply

My next steps are to automate the replicaset bonding and then shard joining. The open source tooling for this portion isn’t what I want, with the closest being mongo ansible. It’s an established tool but I want something more declarative and a simpler model of what it will do when executed. As a result, the answer might be a custom terraform provider to manage the internal configuration state of MongoDB. Philosophically the CRUD resource management and plan/deploy phase of Terraform matches what will give me confidence using this on production clusters.

I’ll open source the work if it gets to a mature spot. Right now the terraforming successfully spins up all the mongod nodes, networking, VPCs, security groups, ec2 instances, ebs volumes and they auto-join their consul cluster.

Credit for the concept of this approach belongs to multiple different blog posts, but the original idea of ASG + EBS re-attaching came from reading about how Expedia operates their sharded clusters. Thanks!