Part 1 of The Perfect Proxy
Good fences make good neighbors but MAX_EXECUTION_TIME reduces noisy neighbors - DBA
Intro
Most database reliability strategies rely on “good enough” fences: connection limits, query timeouts, and static rate limiting. But at high scale, 99.995% and above, “good enough” is a liability. While we dream of infinite auto-scaling, the reality is that database clusters are heavy, slow to move, and expensive to over-provision. To achieve true resilience, we don’t need more hardware; we need better admission control. This post argues that fairness algorithms and load-based rate limiting, implemented at a database-agnostic proxy layer, is the missing step function for modern infrastructure.
The Failure of Static Limits
Traditional safeguards are woefully insufficient because they measure the wrong things.
- Request-per-Second (RPS) Limits treat a 1ms primary key lookup the same as a 10-second table scan.
- Connection Counts protect memory but not CPU.
- Query Timeouts kill the request too late, after the work is already done.
In a multi-tenant environment, these coarse controls allow “noisy neighbors” to saturate shared resources before your monitoring even alerts. Internal tools at large companies solve this with adaptive, resource-aware throttling, but these solutions remain locked behind proprietary doors. Examples in this problem domain are X-Stor1 or Abase2 designed for hard multi-tenant usage which control per tenant load in the cluster with fairness algorithm. Similarly, AirBnB and Uber have published work on their in-house proxy solution for resource usage based rate limiting and quality of service enforcement 3 4. An open-source example of of load based rate limiting also exists in TiDB and informed these conclusions.
We need that same level of sophisticated control, with Fairness Algorithms and Resource Unit (RU) tracking, available as a generic primitive for any datastore.
Business Justification
We propose a unified design of mproxy (eg multi-proxy) that modernizes legacy databases for reliability, cost efficiency and developer velocity.
Designing a proxy separate from the database engine, we achieve three core business objectives:
1. Reliability
The primary goal is to eliminate cascading failures and the “noisy neighbor” effect inherent in clusters with a variety of workloads which is especially valuable in multi-tenant but still valuable in single tenant scenarios (critical vs less critical traffic).
- Failure Isolation: By utilizing Resource Usage (RU) based throttling, we reduce the “blast radius” of a single tenant’s runaway queries. This mitigates the risk of site-wide outages that typically result in significant revenue loss and SLA penalties.
- Automated Circuit Breaking: The proxy acts as a safety valve, automatically draining request buckets during backend latency spikes. This prevents the “thundering herd” and ensures that transient issues do not escalate into permanent database degradation.
- Performance: resource isolation per tenant and within a tenant’s queries enables pushing the boundary on p999 latency and total QPS.
2. Cost Efficiency
mproxy transforms how we manage and pay for infrastructure by providing granular visibility into every query’s relative cost.
- Granular Cost Attribution: Transitioning from simple RPS to Resource Units (RUs) allows for precise internal accounting. This data enables usage-based billing and allows the business to identify and manage tenant economics.
- Increased Hardware Density: Sophisticated load balancing (P2C) and connection management allow for higher tenant density per cluster. This “bin-packing” approach can reduce overall cloud infrastructure spend without compromising p99 latency for premium tiers.
3. Developer Velocity
The proxy acts as a programmable shim that abstracts away the “scary” parts of the database, letting engineers ship faster.
- Self-Service Guardrails: We eliminate the “fear of shipping” by implementing RU-based rate limits and Runaway Query Management at the proxy layer. Product teams can iterate on complex queries knowing the proxy will sandbox sub-optimal code before it melts the cluster, reducing the need for exhaustive manual reviews.
- Immediate Debugging Loops: Built-in Fingerprinting and OpenTelemetry provide an instant, per-query view of resource consumption. This shortens the feedback loop from “Why is the site slow?” to “This specific query is burning 10k RUs,” enabling targeted optimization instead of guesswork.
With all that said, when does it make sense to adopt such a tool that adds complexity and latency? My heuristic here is when any of the following are true:
- If you have multiple workloads inside the same cluster causing noisy neighbor problems
- Most valuable on massive multi-tenant clusters but remains valuable when a single tenant has has disparate internal workloads like critical and non-critical traffic.
- Reliability is needed at >= 99.9% SLA (availability or latency)
Conclusion: The multi-proxy hero we need
The core logic of this proxy:
- Resource usage tracking and rate limiting
- Advanced Load Balancing
- Traffic shaping and circuit breaking
is fundamentally independent of the database technology.
By designing an abstract interface, we can extend these protections to legacy SQL and NoSQL stacks, providing a high-availability proxy with dynamic configuration reloads. This approach leans on best-in-class industry experience at scale and enables us to fill critical capability gaps in modern infrastructure while continuing to rely on our trusted datastores.
I’ve christened it: mproxy, a multi-proxy for the datastore layer! Hat tip on the name to prior art in the fintech sector 😂. We build on the shoulders of giants!
I’ve built a prototype that works for mongo for many of these capabilities, now it’s time to discard that design and rebuild the generic implementation with robust simulation testing, clean backend adapter interfaces, and high performance.
Stay tuned!
Part 2 will dive into practical details of the implementation.