These are the best practices for storage design and reliability based on my software engineering career of over a decade and going deep on this topic the last 4 years.
- Write throughput is harder to scale than read throughput
- Improve reliability and simplify operational changes by segregating your data by Tier or Scale. Implementing this pushed our DB reliability from ~3.5x 9s to 4.5x 9s
- Start with Mysql (recommended) or Postgresql (acceptable) for most workloads
- Use hosted offerings (Aws Aurora Mysql) until the downsides are overwhelming (N-years)
- Hosted databases that offer 99.99% availability lie, that excludes upgrades, config changes, vertical scaling the writer, etc. Do one of those things per month and assume you have only one DB backing your service and you’re dangerously close to breaching 4x 9s. So plan on it being unable to offer > 3.5 9s.
- If you want a genuine 99.99 or 99.999% use DynamoDB or Global Dynamo because unlike AWS Aurora, Dynamo is a fully managed service w/o those drawbacks
- Control all schema changes as part of your SDLC (eg commit your migration files and have a CI job + goose or flyway to execute them)
- Enforce your schemas in Database (RBDMS) or in a single domain modeling service that is the only client of your database and enforce it at the serialization/deserialization boundary (Mongo) or try using their newer schema enforcement in DB. Doing otherwise will lead to higher rates of defects.
- Emit and record db driver metrics (clientside) and lower level DB metrics with something like PMM (serverside)
- Be proficient forcing failovers, you’ll need to do it occasionally in RBDMs or Mongo. Make it a CLI or automated process based on gray failure characteristics and some pre-checks.
- Controversial opinion: prefer Mysql > Postgresql for more advanced usage and operational advantages (see Uber’s article on switching)
- Reliability sources: schema changes, cluster saturation, hardware failures, query plan changes. Each has some amount of mitigation to lessen frequency or the impact.
- Ideal caching primitive characteristics: request coalescing (singleflight), serve stale, dynamic scaling, stores in a tiered system of RAM + NVMe, avoids cold cache problem, builtin sharding, all records have a TTL to avoid zombie records.
- Online Storage Platform Triad: RBDMs, NoSQL, K/V store/cache
- One day you’ll need sharding but in 2024, use established architectures and avoid re-inventing the wheel. You can vertically scale, then vertically partition to forestall it. Read about spanner’s origins before you decide on building your own sharding layer and consider how you’ll reshard, avoid hotspots, decide shard keys for every table, avoid fan out queries, store metadata
- On day 1 you’ll need connection pooling, get proficient with tuning these. Less can be more throughput.
- One day you may want Proxysql to handle connections and routing of traffic
- Alerting - use low pri alerting for things you’ll handle in current or next business hours. Use high-pri alerting to get an operater online 24/7 to intervene in < 5m from signal being emitted.
- Alerts must have a link to actionable runbook steps. Graduate to cli commands or dynamic runbooks followed by full automated responses.
- Scale as far as you can vertically and if possible use caching. Then start scaling horizontally.
- Metrics in aggregate will lie to you, esp in AWS dashboards on 5m aggregation of averages. You often want to see 1m and maximum or minimum to catch micro-latency spikes in EBS
- Use EBS for operational simplicity and durability, use NVMes for performance and better cost
- Plan your Business Continuity / Disaster Recovery and know your SLA/SLO/SLIs for the systems your team/division/platform runs.
- Datbases are some of your top reliability risks, especially when immature. Use reader instances and if you need availability be willing to give up consistency
- Replatforming takes months to years, so know your scaling runway (disk/architecture/throughput) and plan for 2-5yr horizon of predictable growth
- If everything’s on fire, get your systems ‘stable enough’ to buy time for making bigger investments
- Any data that’s 10x to 100x of its peer data should live in separate cluster or partition of cluster. Especialy when getting into 10s to 100s of TB.
This is a very terse list of best practices that have lifted our DB reliability as a platform from 3.5x 9s to >= 99.995 and made our oncall rotation one of the lighter ones in the company.