10 May 2023

On Reliability

I read The Calculus of Service Availability: You’re only as available as the sum of your dependencies today and it summarizes some of the most salient wisdom in designing for reliability targets.

The takeaways are:

Humans have enough imperfections in their nearby systems that 4 or 5 9s of reliability is the maximum value worth targeting
Problems come from the the service itself or its critical dependencies.
Availability = MTTF/(MTTF+MTTR)
Rule of an extra 9 - Any service should rely on critical services that exceed their own SLA by 1x 9 of reliability, ie a 3x 9s service should only depend on 4x 9s services in its critical path.
When depending on services not meeting that threshold, it must be accounted for via resilient design
The math:

Assuming a service has error budget of 0.01%.
Choose 0.005% error budget for service and the 0.005% for critical dependencies, then 5 dependencies each get 1/5 of 0.005% or 0.001%, ie must be 99.999% available.
If calculating aggregate availability: 1x service and 2x dependency services of 99.99% yield 99.99 * 99.99 * 99.99 = 99.97% availability. This can be adjusted by improving availability through redundancy or removing hard dependencies.
Frequency * Detection * Recovery = the impact of outages and the feasibility of a given SLA.

Consequently, the levers for improvement are: reduce frequency of outage, blast radius (sharding, cellular architectures, or customer isolation), and MTTR.

I’ve been evolving our Online Database Platform at work and the themes of “rule of an extra 9” and how to move quickly as well as safely with limited blast radius are top of mind. The tradeoff here is complexity and a rule of thumb that each additional 9 costs 10x the effort/cost.

We’ve made some major changes (cluster topology, upgrades in nosql and sql, automation tooling) that are moving our stack to a point where I’m proud of the accomplishments.

Hundreds of TB of online data and hundreds of clusters managed by a team that I can count on one hand :).

xargs.io

All the IO and Multiplexing

10 May 2023