One of my teams at work is the online storage team (love it!), so I’m focusing efforts on how we improve the high availability of these systems and improve the system behavior during incidents or degradation. ie all the MTs: MTTR, MTBF and MTTD.
I’ve been spending a lot of time considering reliability improvements and turning those into a series of architecture and storage design principles to follow.
For these systems we’ve learned that our greatest leverage is in MTBF, which is predicated on operating these systems in the messy real world of cloud compuuting with the expectation of hardware failures (gp2 for 99.8 to 99.9% SLA/yr/volume).
What’s a recipe for improving current system behaviors?
- Partition critical and non-critical workloads
- Use read replicas as heavily as consistency requirements allow
- Choose the right cost/reliability threshold for your workloads (gp3 vs io2)
- Remove cross-shard queries from your critical path
- Run storage systems with sufficient headroom that you don’t have to performance tune in a panic
- Ensure 1-2 years of architectural runway on systems in case you need to shift to a new storage platform (ie eat your veggies first)
- Horizontally shard at application layer by promoting out hot or large tables to dedicated cluster (ie for aurora mysql)
With enough business success and a data intensive features, you’ll hit an inflection point you must either adopt a new online storage technology OR invest heavily in a sharding layer on top of your existing storage (vitess on mysql). Based on evidence of adoption at other companies, adopting a complex sharding solution is more expensive and less featureful than adopting a full storage platform including those features.
Ideal online storage system characteristics
Legend: ✅=yes, ☑️=partial, ⭕=no
Feature / Behavior | Aurora MySQL 5.x | MongoDB 3.x | TiDB 6.x | ScyllaDB 2022 | FoundationDB 7.0 |
---|---|---|---|---|---|
No downtime or errors due to node failure, maintenance or upgrades | ⭕ | ☑️ 1 | ☑️ 2 | ✅ 3 | ✅ 4 |
Workload priority management | ⭕ | ⭕ | ☑️ | ✅ 5 | ☑️ |
Placement rules | ⭕ | ✅ 6 | ✅ 7 | ⭕ | ⭕ |
Table partitioning | ⭕ | ☑️ | ✅ | ⭕ | ✅ |
Full hardware utilization of writer/readers | ⭕ | ⭕ | ✅ | ✅ | ✅ |
Ability to transparently and safely rebalance data | ⭕ | ☑️ 8 | ✅ | ✅ | ✅ |
Linear horizontal scaling | ⭕ | ✅ 9 | ✅ | ✅ | ✅ |
Change Data Capture | ✅ | ✅ | ✅ | ✅ | ✅ |
Prioritize consistency or availability per workload | ⭕ | ⭕ | ⭕ | ✅ | ☑️ |
Good enough support for > 1 workload (k/v, sql, document store) | ⭕ | ⭕ | ✅ | ☑️ | ✅ |
Low operational burden | ✅ | ✅ | ✅ | ✅ | ✅ |
Supports/allows hardware tiering in db/table | ⭕ | ☑️ | ✅ | ⭕ | ⭕ |
Safe non-downtime schema migration | ⭕ | ☑️ | ✅ | ✅ | ✅ |
OLAP workloads | ⭕ | ⭕ | ✅ 10 | ☑️ | ⭕ |
SQL-like syntax for portability and adoption | ✅ | ⭕ | ✅ | ☑️ | ⭕ |
Licensing | ✅ | ☑️ 11 | ✅ | ☑️ 12 | ✅ |
Source available | ⭕ | ✅ | ✅ | ✅ | ✅ |
Legend
(Chart filled in using my significant experience with MongoDB (<= 3.x) and Aurora Myql (5.x) but knowledge of TiDB, ScyllaDB and FoundationDB comes from architecture, documentation, code, and articles)
Predictions for the next 10 years of online storage
- Tech companies will require higher availability so we’ll see a shift towards multi-writer systems with robust horizontal scalability that are inexpensive to operate on modest commodity hardware.
- Storage systems will converge on a foundational layer of consistent distributed K/V storage with different abstraction layers on top (TiDB/Tidis/TiKV, TigrisDB, FoundationDB) to simplify operations with a robust variety of features.
-
Downtime during failover but good architecture for maintenance and upgrades ↩︎
-
Downtime during failover but good architecture for maintenance and upgrades ↩︎
-
Best ↩︎
-
Best ↩︎
-
In Enterprise Version ↩︎
-
In Zones ↩︎
-
Placement Rules ↩︎
-
Architecture in 3.x has risky limitations and performance issues on high throughput collections ↩︎
-
See ^8 ↩︎
-
HTAP functionality through listener raft nodes on TiFlash with columnular storage ↩︎
-
SSPL terms are source available but not OSI compliant with and have theoretical legal risks in usage on par with AGPL except less widely understood ↩︎
-
AGPL ↩︎