05 Mar 2026

Surviving Database Migrations: Ambitious and a little crazy

Surviving Database Migrations: Ambitious and a little crazy

At Plaid, we just finished a two-and-a-half-year project to migrate 234 databases across 100 services from AWS Aurora MySQL to self-hosted TiDB.

We’ve been asked many times by colleagues in peer organizations and by our vendor what made us different in achieving our migration with a high velocity and high correctness without a commensurate high headcount or mature team.

The following is a treatise to answer that question from our lived experience.

Where it began

Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning. - Churchill

At the end of 2021, I was a team manager for a platform team that depended heavily on a nosql data store that was unruly, unmaintained and massive. We considered managed services, but decided it was cheaper to build out our own team, and we’ve been right about that!

Due to repeated outages, I secured headcount by the end of the year to build out an online storage team. In 2022, I was the director for the core services area of the platform, which covered my original team, a sibling team, and the newly formed but empty online storage team.

I poached our founding engineer from my division’s hiring pipeline because of his resume and his obsession with databases. I found out later he would never have accepted an offer for the core services area that I led, except for the pitch I made about founding our online storage team.

Hiring Mingjian Liu was a great start. Now we had one talented engineer to run the database servers for an $8 billion company and for supporting 350 software engineers… This was an improvement!

Next, he and I took alternating weeks of responsibility for level one (L1) on-call for the most reliability-critical data stores at Plaid. On our off weeks, we were L2 on-call.

At this stage, we were tamping down the bigger fires in data store reliability, things like:

  • building observability (thanks PMM!)
  • standardizing alerts
  • advanced tunings
  • Working with client teams on group commits
  • workload separation
  • dodging between outages to try to get back to our planned work

By “we”, I mean my founding engineer because I was a director for the division in wall to wall meetings and moonlighting as a software engineer.

We earned a lot of good will in fighting the great Winter of NoSQL outages and putting the org back on track. Then we took over ownership of Mysql 5.6 to 5.7 upgrades when learning how AWS Aurora blue/green deploys caused a 30 minute outage. By this time, we had two more engineers on the team, and we were building momentum.

At points when we caught our breath in the upgrades to 5.7, we debated our longer term strategy: do we follow the default path and upgrade to 8.0 or make a bet on a better future with an alternate technology?

Team Building

You go to war with the army you have, not the army you might want or wish to have at a later time. - Rumsfeld

Hiring is hard, and hiring fast is harder, but there’s a cheat code here, which is you do not need deep expertise in a given domain for the whole team: if you have good leadership and technical depth in a few members.

I made the choice that we would prioritize ownership, operational excellence, and general platform experience rather than requiring deep database expertise in our team.

This required some trade-offs… It meant that a lot of the initial design work rested on the shoulders of our founding engineer and myself. It required us to be better at scoping the effort and mentoring and coaching others to grow along the way.

It means looking back on this from 2026 that we have a crew of people in the team who were incredible contributors and grew through the process and made it possible. Their lack of database pedigree was not an impediment.

One of them stepped into the role of team tech lead. Another one stepped into the role of engineering manager for the storage team. Our attrition was very low during this project arc because of meaningful work, camaraderie and good morale.

Resource constraints drive invention

Difficulties mastered are opportunities won. - Churchill

Now we had a team and we’d faced our first few battles and survived.

With the prospect of the MySQL 8.0 upgrades, we veered hard off-road and made a calculated bet:

We can deliver this project to the organization to improve the life of every developer at Plaid and for our customers, with a 6 person team largely missing deep database expertise, in an amount of time that stood up to business scrutiny.

How’s that for ambitious and a little bit crazy?

If we took a standard approach, or if we didn’t have our team buy-in and leadership buy-in, we would have failed. I think of this category of project as a project that will default fail.

Converting “default fail” into “will succeed”

Business Constraints

To varying degrees, we nerd out about perfection when operating at the data store layer:

  • It may be perfection of read-your-writes consistency.
  • It may be perfection of snapshot isolation and phantom read .
  • It may be perfection of steady p999 latency or achieving greater than five nines of availability.

But the business doesn’t give a flying fuck about perfection and that’s the correct fiscal invariant!

The business cares about money and what drives growth. Datastores, reliability, consistency and disaster recovery are implementation details in service of the customers and shareholders.

The business wants what is good enough for the circumstance that the business operates in…And that’s what we needed to target after accounting for a safety margin.

So we set out to categorize our services and our business constraints to find the least common denominator of what would exceed their tolerances but not cost an arm and a leg to build. Then we reviewed those business targets with engineering leadership to bucket our services into criticality and archetypes. This allowed us to make key decisions once rather than re-litigating it for every service.

A tactical example here was our decision to do controlled cutovers via feature flags with 60 to 120 seconds of writer downtime when cutting from Aurora to TiDB. We spec’d out the process for how we would do zero-downtime cutovers, instead of 60-120 second cutovers. We realized that the effort and complexity to get the zero-downtime solution correct, and the limited cases in which we would allow that kind of consistency risk, was not worth the investment.

We prioritized consistency guarantees for our critical financial data, with the acceptable error budget loss of 60 to 120 seconds of downtime on the write pathway.

Paying those losses was no worse than what we would pay had we upgraded to MySQL 8.0, but they transported us to a platform where we’ve had zero downtime across three upgrades across all clusters, across 100 services in 2025.

Better every time

If you are going through hell, keep going. - Churchill Migrations suck, so we made them suck less. - Zander

Empower your team and insist that they make their process better every time they do it. Whether that’s:

  • contributing back documentation on a runbook
  • contributing back reliability risks from a collation
  • adding a tiny bit of automation to a set of Terraform changes

These trivial things compound and become superpowers.

Don’t start out by boiling the ocean on automation. Don’t bludgeon it with waterfall design until you run out of budget and get fired.

Start small. Start manual. Empower your team that every time they do a thing that is annoying and repetitive or risky, they contribute back a fix.

We started with engineering logs that were just scratch notes of all of the steps we were taking. After you do that a time or two, you know what’s repeated and you dread it.

Imagine that step takes 4 hours of human time and you’re going to have to do another 234 times over two and a half years, so you polish that into a runbook that lives in Google Docs.

Over a couple months, you have 40 of those runbooks for different steps of the process. All contributed by different members of the team and cross-reviewed for simplicity and precision. Now you need an index for your runbooks, which you call your master runbook. Every time someone starts a new service migration, they copy the master runbook, and that is their journal, their engineering journal, for ensuring that they complete every step perfectly.

Brute force copy/pasting works for the first 10 or 20 services that you’re going to migrate out of 100, but the slowness and the monotony and the mistakes creep in.

So I built a system for us called dynamic runbooks that are glorified Jupyter notebooks that run the Deno runtime in them for familiarity on our team and security stance. Those dynamic runbooks substantially replace the operational steps from your Google docs. Instead of copying and pasting, we’re now running a Jupyter Notebook for a specific cluster and service for a specific set of steps.

Pointing and clicking > Copy and pasting - Zander

We’re pointing and clicking our way to success.

It’s faster, less error prone and self-describing and cannot drift out of date. It continues to empower our engineers to contribute back to the process to automate what we’re doing. It transforms frustrations into frustration-driven developments.

These dynamic runbooks started very basic. They were nearly a copy and paste job of the Google Docs, but then we started wrapping common behavior in them, like specific commands or API executions, into an SDK and Deno that we would import and use in the runbooks. We set up an asynchronous job runner for these tasks when some of the tasks started to need to run uninterrupted for over 24 hours. That job runner also allowed us to have easy streaming log access for tasks, so that I could check on a colleague’s tasks if something was happening in our validation cluster.

Cross Functional Alignment

You need the goodwill of the organization for 2.5 years, and goodwill does not last that long. You need to fight entropy and demonstrate wins along the way while having little to no attrition.

Every service you move forward to the next platform should be able to demonstrate improvement. They may not be perfect, but you need the teams who operate those services and their managers to gradually have better experiences with the data store layer. Frankly, if you lack that, it’s not an alignment issue; it’s a failure to deliver on promises.

Building the dream, in-flight

Envisioning a future rarely fits neatly into OKRs.

Build the dream in the terms of the value that it provides to the organization as a whole, and then figure out how you can represent that with meaningful metrics. There are many parts of the project early on that will be ambiguous, but make sure you have a north star of specific pillars that you’re delivering on for how it impacts the organization as a whole. For us, it was:

  • reliability
  • Reducing KTLO
  • developer velocity

Make it simple to understand progress and report it outwards to leadership. It’s simple for your team, simple for you, and simple for leadership.

Frontload Risks

It is no use saying, ‘We are doing our best.’ You have got to succeed in doing what is necessary. - Churchill

You want to build as rapidly as safely possible toward your hardest challenge in the arc of this epic journey.

We started with a few Tier 2 services, then a Tier 1, quickly moving to our first Tier 0: a major driver for the re-platforming. While avoiding your biggest service first, tackle it within the project’s first 50%. We did our first Tier 1 within 25% and Tier 0 within 40%. With every service and every latency blip early on in our adoption, we were thinking about how we de-risk it for the Tier 0.

Frontloading risks and climbing the highest peak early improves project success. If you learn about critical flaws early in a safe way, that’s also a win.

If we can’t meet the needs of the T0, we don’t deserve to be running this platform for the organization. - Zander

Championship Team

Success is not final, failure is not fatal: it is the courage to continue that counts. - Churchill

Your team needs a prior track record of success before any business leader wants to pay for six ~FAANG company software engineers to work on a project for two and a half years. It’s expensive and it bloody well has to succeed.

As the leader, it is your job to make sure it succeeds. There is no obstacle that is an acceptable blocker. It’s your reputation on the line and your promise to the organization.

Conclusion

In summary, if you follow these six “easy” steps, you too can move 234 databases in 2.5 years with six software engineers at a high level of reliability and correctness.

We think these are the secret sauce that made it possible.