At work we adopted TiDB for multi-tenant workloads and we leaned in hard as some of the early adopters of Resource Controls.
We run as many as 100 services in a single cluster with a variety of workload types, and our cluster deployment strategy is designed around workload maturity and size tiering.
We knew the only cost-effective way to run these clusters was as a heavily multi-tenanted design. We figured out our optimal design for how to use resource controls and resource units in TiDB to provide fair usage of the clusters and avoid noisy neighbor problems. Workload decorrelation 1
Here is everything we learned in two and a half years of operating Tier 0 through Tier 2 clusters at 99.99%+ availability and latency SLA.
Background
Load Based Resource Units are a concept coming out of top tier online storage systems such as Pingcap’s TiDB (Resource Units)2, Tencent X-Stor (Request Unit)3 and ByteDance Abase (Request Units)4. DynamoDB uses a similar costing metric of Read or Write Request Units. These units are a composite measurement of load in the backend cluster via direct and indirect signals of cpu / memory / disk and network usage.
Common design patterns in RUs are that writes are multiples more expensive than reads both for the per count charge and for the payload size charge. This reflects the internal cost of write replication for durability and also write amplification of indexing.
When using resource units, the goal is to provide as much capacity to individual services as possible while maintaining stability at a per-service level and at a cluster level.
Design
Our best practices are as follows in a heavily microservice environment
- Every service gets one main application user and may add additional users
- Every application user gets a dedicated Resource Group
- Every Resource Group is assigned an allowance of RU/s
- Every Resource Group is assigned a priority
- Application users = HIGH
- Schema migration user = LOW
- Extra users receive appropriate priority level
- Note: fixes land in TiDB v8.5.x so we can also use MEDIUM but currently it misbehaves during high cpu cluster states.
- No Resource Group is allowed Burstable=true, because it does not evenly share resources proportional to RU% of the whole.
- Every Resource Group gets a Runaway Query guard (at least once patches land later in v8.5.x) to deprioritize queries taking > N seconds
- Eventually we will adopt Runaway Query pattern based on keys scanned or RU consumed.
- RU/s are allocated for peak tolerable throughput needs of the service with the multi-tenant assumption that traffic spikes from many services is not strictly correlated.
- Eg cluster capacity is over-provisioned like airline seats or insurance policies assuming that not everyone will utilize their full capacity at once
The assignment of values is automated through a wrapper layer on top of our forked terraform-provider-mysql link
Cluster capacity estimates of “RU/s per cluster” are determined through real world usage starting with very high safety factors and then refined over time depending on workload predictability.
By tracking the RU consumption of individual services over time, we can provide chargebacks to the internal teams to improve their product-level margin calculations and influence efficient infrastructure usage.
Future Improvements
I’m excited to see the improvements that are landing for priority level queuing and for runaway query improvements in the v8.5.x series. I hope that there are further improvements to the runaway queries for RU consumption to ensure that those apply to write operations as well as read operations.
Conclusion
Resource Units are a powerful and underutilized capability in our industry and datastore niche. I’m looking forward to a future project where I can bring them to more datastore stacks!