engineering excellence

Database Best Practices

reliability engineering excellence principles

These are the best practices for storage design and reliability based on my software engineering career of over a decade and going deep on this topic the last 4 years.

Write throughput is harder to scale than read throughput
Improve reliability and simplify operational changes by segregating your data by Tier or Scale. Implementing this pushed our DB reliability from ~3.5x 9s to 4.5x 9s
Start with Mysql (recommended) or Postgresql (acceptable) for most workloads
Use hosted offerings (Aws Aurora Mysql) until the downsides are overwhelming (N-years)
Hosted databases that offer 99.99% availability lie, that excludes upgrades, config changes, vertical scaling the writer, etc. Do one of those things per month and assume you have only one DB backing your service and you’re dangerously close to breaching 4x 9s. So plan on it being unable to offer > 3.5 9s.
If you want a genuine 99.99 or 99.999% use DynamoDB or Global Dynamo because unlike AWS Aurora, Dynamo is a fully managed service w/o those drawbacks
Control all schema changes as part of your SDLC (eg commit your migration files and have a CI job + goose or flyway to execute them)
Enforce your schemas in Database (RBDMS) or in a single domain modeling service that is the only client of your database and enforce it at the serialization/deserialization boundary (Mongo) or try using their newer schema enforcement in DB. Doing otherwise will lead to higher rates of defects.
Emit and record db driver metrics (clientside) and lower level DB metrics with something like PMM (serverside)
Be proficient forcing failovers, you’ll need to do it occasionally in RBDMs or Mongo. Make it a CLI or automated process based on gray failure characteristics and some pre-checks.
Controversial opinion: prefer Mysql > Postgresql for more advanced usage and operational advantages (see Uber’s article on switching)
Reliability sources: schema changes, cluster saturation, hardware failures, query plan changes. Each has some amount of mitigation to lessen frequency or the impact.
Ideal caching primitive characteristics: request coalescing (singleflight), serve stale, dynamic scaling, stores in a tiered system of RAM + NVMe, avoids cold cache problem, builtin sharding, all records have a TTL to avoid zombie records.
Online Storage Platform Triad: RBDMs, NoSQL, K/V store/cache
One day you’ll need sharding but in 2024, use established architectures and avoid re-inventing the wheel. You can vertically scale, then vertically partition to forestall it. Read about spanner’s origins before you decide on building your own sharding layer and consider how you’ll reshard, avoid hotspots, decide shard keys for every table, avoid fan out queries, store metadata
On day 1 you’ll need connection pooling, get proficient with tuning these. Less can be more throughput.
One day you may want Proxysql to handle connections and routing of traffic
Alerting - use low pri alerting for things you’ll handle in current or next business hours. Use high-pri alerting to get an operater online 24/7 to intervene in < 5m from signal being emitted.
Alerts must have a link to actionable runbook steps. Graduate to cli commands or dynamic runbooks followed by full automated responses.
Scale as far as you can vertically and if possible use caching. Then start scaling horizontally.
Metrics in aggregate will lie to you, esp in AWS dashboards on 5m aggregation of averages. You often want to see 1m and maximum or minimum to catch micro-latency spikes in EBS
Use EBS for operational simplicity and durability, use NVMes for performance and better cost
Plan your Business Continuity / Disaster Recovery and know your SLA/SLO/SLIs for the systems your team/division/platform runs.
Datbases are some of your top reliability risks, especially when immature. Use reader instances and if you need availability be willing to give up consistency
Replatforming takes months to years, so know your scaling runway (disk/architecture/throughput) and plan for 2-5yr horizon of predictable growth
If everything’s on fire, get your systems ‘stable enough’ to buy time for making bigger investments
Any data that’s 10x to 100x of its peer data should live in separate cluster or partition of cluster. Especialy when getting into 10s to 100s of TB.

This is a very terse list of best practices that have lifted our DB reliability as a platform from 3.5x 9s to >= 99.995 and made our oncall rotation one of the lighter ones in the company.

On Reliability

micro engineering excellence principles

I read The Calculus of Service Availability: You’re only as available as the sum of your dependencies today and it summarizes some of the most salient wisdom in designing for reliability targets.

The takeaways are:

Humans have enough imperfections in their nearby systems that 4 or 5 9s of reliability is the maximum value worth targeting
Problems come from the the service itself or its critical dependencies.
Availability = MTTF/(MTTF+MTTR)
Rule of an extra 9 - Any service should rely on critical services that exceed their own SLA by 1x 9 of reliability, ie a 3x 9s service should only depend on 4x 9s services in its critical path.
When depending on services not meeting that threshold, it must be accounted for via resilient design
The math:

Assuming a service has error budget of 0.01%.
Choose 0.005% error budget for service and the 0.005% for critical dependencies, then 5 dependencies each get 1/5 of 0.005% or 0.001%, ie must be 99.999% available.
If calculating aggregate availability: 1x service and 2x dependency services of 99.99% yield 99.99 * 99.99 * 99.99 = 99.97% availability. This can be adjusted by improving availability through redundancy or removing hard dependencies.
Frequency * Detection * Recovery = the impact of outages and the feasibility of a given SLA.

Consequently, the levers for improvement are: reduce frequency of outage, blast radius (sharding, cellular architectures, or customer isolation), and MTTR.

I’ve been evolving our Online Database Platform at work and the themes of “rule of an extra 9” and how to move quickly as well as safely with limited blast radius are top of mind. The tradeoff here is complexity and a rule of thumb that each additional 9 costs 10x the effort/cost.

We’ve made some major changes (cluster topology, upgrades in nosql and sql, automation tooling) that are moving our stack to a point where I’m proud of the accomplishments.

Hundreds of TB of online data and hundreds of clusters managed by a team that I can count on one hand :).

Storage Platforms for the Modern Era

engineering excellence principles

One of my teams at work is the online storage team (love it!), so I’m focusing efforts on how we improve the high availability of these systems and improve the system behavior during incidents or degradation. ie all the MTs: MTTR, MTBF and MTTD.

I’ve been spending a lot of time considering reliability improvements and turning those into a series of architecture and storage design principles to follow.

For these systems we’ve learned that our greatest leverage is in MTBF, which is predicated on operating these systems in the messy real world of cloud compuuting with the expectation of hardware failures (gp2 for 99.8 to 99.9% SLA/yr/volume).

What’s a recipe for improving current system behaviors?

Partition critical and non-critical workloads
Use read replicas as heavily as consistency requirements allow
Choose the right cost/reliability threshold for your workloads (gp3 vs io2)
Remove cross-shard queries from your critical path
Run storage systems with sufficient headroom that you don’t have to performance tune in a panic
Ensure 1-2 years of architectural runway on systems in case you need to shift to a new storage platform (ie eat your veggies first)
Horizontally shard at application layer by promoting out hot or large tables to dedicated cluster (ie for aurora mysql)

With enough business success and a data intensive features, you’ll hit an inflection point you must either adopt a new online storage technology OR invest heavily in a sharding layer on top of your existing storage (vitess on mysql). Based on evidence of adoption at other companies, adopting a complex sharding solution is more expensive and less featureful than adopting a full storage platform including those features.

Ideal online storage system characteristics

Legend: ✅=yes, ☑️=partial, ⭕=no

Feature / Behavior	Aurora MySQL 5.x	MongoDB 3.x	TiDB 6.x	ScyllaDB 2022	FoundationDB 7.0
No downtime or errors due to node failure, maintenance or upgrades	⭕	☑️ ¹	☑️ ²	✅ ³	✅ ⁴
Workload priority management	⭕	⭕	☑️	✅ ⁵	☑️
Placement rules	⭕	✅ ⁶	✅ ⁷	⭕	⭕
Table partitioning	⭕	☑️	✅	⭕	✅
Full hardware utilization of writer/readers	⭕	⭕	✅	✅	✅
Ability to transparently and safely rebalance data	⭕	☑️ ⁸	✅	✅	✅
Linear horizontal scaling	⭕	✅ ⁹	✅	✅	✅
Change Data Capture	✅	✅	✅	✅	✅
Prioritize consistency or availability per workload	⭕	⭕	⭕	✅	☑️
Good enough support for > 1 workload (k/v, sql, document store)	⭕	⭕	✅	☑️	✅
Low operational burden	✅	✅	✅	✅	✅
Supports/allows hardware tiering in db/table	⭕	☑️	✅	⭕	⭕
Safe non-downtime schema migration	⭕	☑️	✅	✅	✅
OLAP workloads	⭕	⭕	✅ ¹⁰	☑️	⭕
SQL-like syntax for portability and adoption	✅	⭕	✅	☑️	⭕
Licensing	✅	☑️ ¹¹	✅	☑️ ¹²	✅
Source available	⭕	✅	✅	✅	✅

Legend

(Chart filled in using my significant experience with MongoDB (<= 3.x) and Aurora Myql (5.x) but knowledge of TiDB, ScyllaDB and FoundationDB comes from architecture, documentation, code, and articles)

Predictions for the next 10 years of online storage

Tech companies will require higher availability so we’ll see a shift towards multi-writer systems with robust horizontal scalability that are inexpensive to operate on modest commodity hardware.
Storage systems will converge on a foundational layer of consistent distributed K/V storage with different abstraction layers on top (TiDB/Tidis/TiKV, TigrisDB, FoundationDB) to simplify operations with a robust variety of features.

Downtime during failover but good architecture for maintenance and upgrades ↩︎
Downtime during failover but good architecture for maintenance and upgrades ↩︎
Best ↩︎
Best ↩︎
In Enterprise Version ↩︎
In Zones ↩︎
Placement Rules ↩︎
Architecture in 3.x has risky limitations and performance issues on high throughput collections ↩︎
See ^8 ↩︎
HTAP functionality through listener raft nodes on TiFlash with columnular storage ↩︎
SSPL terms are source available but not OSI compliant with and have theoretical legal risks in usage on par with AGPL except less widely understood ↩︎
AGPL ↩︎

Hiring That First Backend Engineer

micro transcription stream-of-consiousness startup engineering excellence principles

[Facts adjusted to protect the innocent]

A friend approached me to talk about what to look for when hiring their startup’s first backend engineer. The company is ~8 people, with a frontend/3d CTO and 2 strong engineers on the frontend. There is currently no backend tech or staffing, so this is hiring from the ground up.

First off, it’s hard to hire well for a skillset you lack, so get advice from people who have that expertise.

For personality, they need to be scrappy and able to do more with less. This is entirely at odds with the expectations that exist for staff in larger corporations. You want more “indie hacker” style than deep expertise on LSM Trees.

They need to be personable and able to work well with various business units, since it’s all hands on deck to survive the first couple years of a startup. They might need to be front line support for Enterprise customers and will require the patience and tact to retain that customer despite less stable technical systems.

They need to understand the build-versus-buy debate and strongly prefer to buy when staffing is short. They will do things with Zapier, Airtable, etc that you find appalling… and they’ll use that saved time to invest in mission critical systems getting the product features or reliability they need.

They know how to build reliable systems, but save their efforts for the systems that need it. Sometimes a scholastic grade of C+ is the right benchmark for Tier 2 systems and they know how to make the tradeoff.

They’re biased towards boring technologies and realize how few innovation tokens a young startup can truly spend. At this stage of my own development, this means they’re using tech like Golang, gRPC, Aurora MySQL, and hosted solutions. They realize every hour needs to be delivering business value and reinventing wheels with self-hosting or new flashy tech is a failure mode.

They see technology as a means to an end of delivering product features and “wow” for customers.

They’ll need to be the DevOps and Developer Efficiency team, on top of their main backend architect and coder role. They need the skillset and willingness to design IAM permissioning, wrangle sensible and secure defaults into AWS/GCP, and ensure developers have a minimal amount of tooling and support to be effective.

They’re the ones who will setup your Continuous Integrations, Continuous Deployment, deployments, rollbacks, operational manuals, paging system, alerting, metrics, commit hooks, linting, unit testing, end to end testing, staging environments, etc.

They’re designing a storage layer that will likely survive for 5+ years… so while they’re not planning on ROFL scale, they have a general idea of how the current system would evolve into ROFL scale. (TLDR Please use Aurora MySQL… if that can’t handle your ROFL scale… consider DynamoDB or MongoDB can rofl-scale with coaxing and mild care).

They’ll setup your data warehousing and long term archival (blob storage). They don’t need to be an expert here but should have seen it done before. If they can use a hosted solution for warehousing, that’s best. Otherwise, they know enough to choose among various hosted/self-hosted solutions (snowflake, spanner, redshift, clickhouse).

They’ll work with your analysts and eventually will set them up with a dedicated Business Intelligence tool. When I sourced this in a prior company we settled on Periscope and it treated us well. They’ll make sure you run it off a read replica so you don’t endanger production.

They’ll do the rest of your backend hiring, they should either have the skill and experience in hiring, firing, and leadership. They don’t have to be a people manager but do need to be willing to act in whatever way is best for the organization.

They need to be scrappy: ie able and willing to tackle any problem. They should also be willing and able to work on the frontend if that’s the most critical task of the moment. If they can’t solve a problem, they know to find someone who can for advice or admit they’re stuck and brainstorm solutions.

They need to be product minded and think of how their expertise can unlock that product vision. They’ve either considered founding their own startup, worked in a small one or done entrepreneurial work before.

They need to understand the value of dollars, in your startup. It’s different than in larger corporations with greater runway. If someone isn’t productive, it could result in the full startup failing and putting everyone out of work… better to coach that person and fix it or send them packing.

They need to be presentable among a wide audience: sales staff, investors, techcrunch, other engineers.

Startup Principles

startup principles gratitude engineering excellence

TLDR

When I build my own company, I’ll think back to working at a bootstrapped startup and the lessons I learned from our founder.

My business will embody these principles:

Believe (obsessively and doggedly believe)
Keep It Super Simple ¹ ² (radical simplicity)
Ship It (ship the bare minimum and get feedback, repeat)

Honorable mention:

do things that don’t scale ³ (By the time you have scaling limits, you can pay to figure out scale.)

When I’m building features, finding customers, or planning technical architecture, I’ll think of her and say: WWMD.

My Startup Story

Small startups at a cesspool of ideas, experiences, stress, and joy and I learned a ton through my phases in smaller startups. Today, I want to highlight lessons learned from working at a small bootstrapped-ish startup. I’ve also been reading Ray Dalio’s book Principles ⁴.

The startup was ~50 people when I left and I was there for 3 years. I joined two years into their journey (~25 ppl total with 6 in engineer). Money was always tight and yet we were moving two digits of millions in revenue when I left.

Beyond learning from my time there, I learned from the 2 years before me… from the example set by out solo founder.

She had a year or two of professional experience in web development focused on design and frontend and had no business experience…. until she made her own experience! She became obsessed with a business problem encountered casually and from then on, lived, breathed, and fought for her business.

She built a truly minimum viable product… and was a shining example of what you can do with radical simplicity. Her busines started with a Google Form. She leveraged amazing productivity and value from off-the-shelf free tools. Turns out free tiers can be pushed REALLY FAR with the right urgency and necessity! If Airtable existed at the time, I think that would have been our database.

She learned what she needed to know about databases, servers, analytics but it was a means to an end of making the business work. She didn’t get lost in la-la land of falling in love with the tech. They were a tool she became proficient with but they didn’t rule her. Technology was a means to an end of building the business. My time with her honed my business approach to delivering the most value we could with available time and energy. We radically optimized for making the biggest bang for the buck, because there weren’t many bucks to go around.

What drew me to the company was her obsession, passion and success. Her fervor and obsession were infectious and I saw my own star rising with hers. I learned to see the world a more like she did, with less rules and boundaries. I bring that with me to my own engineering leadership and personal values in software engineering.

xargs.io

All the IO and Multiplexing

24 Oct 2024

10 May 2023

01 Nov 2022

What’s a recipe for improving current system behaviors?

Ideal online storage system characteristics

Predictions for the next 10 years of online storage

12 Feb 2022

10 Apr 2021

TLDR

My Startup Story