2024 State of the Union

It’s 2024 and I’ve made some big changes.

After 7 years in engineering leadership, I’ve requested to shift to being a very senior engineer in my organization. During my term in engineering leadership, I:

Was the CTO of a 50 person startup (successfully acquired)
The senior backend director of a major media company
Most importantly, I grew a ton by coming to Plaid and being a line manager in Platform, then a Lead of Leads in Platform, followed by a Tech Lead Manager of the Storage Team.

Why am I ‘holding the career ladder upside down’? It’s to extract the most mutual value… by providing value for my org while also continuing growing in a manner and direction of my choosing.

I’ve gotten to the point where leading teams is no longer interesting and challenging, but building technology is interesting and challenging with plenty of room to grow.

I shifted from being the TLM of Storage, to a senior staff member of Storage and will find my next focus internally once we complete our ambitious projects of 2024.

In the meantime, I’m spending my time automating our processes and helping the team move faster. The major leverage there has been in designing a system of dynamic runbooks to execute the 200+ steps needed to move a service to a new online storage platform with ~90s of write traffic interruption.

In smaller news, I’m having fun coding and building again!

In support of my team:

I forked and added support for Resource Groups in TiDB to the mysql terraform provider which we’ll upstream when it has baked in.
I figured out the type exchange of FFI in deno so that I could create a library for exec’ing processes exec.
I re-built a favorite tool of mine as tome-cli after using tome for a few years in production.
I maintain my own package management system based off of cash-app’s hermit which lives at packages.
I’m contributing back to upstream projects to patch bugs or improve usability (tiup & dagu)
I’ll be presenting about TiDB at HTAP Summit in Sept 2024.

Improved e2e testing by replacing bats-core with deno+dax

micro

When writing tome-cli, I wanted to ensure that the tool’s side effects were accurately tested at their outermost bound and so I needed to wrap the full tool’s binary execution in end to end tests.

Normally I would use bats-core for bash/cli testing. But I’ve been using deno more and more for work and find dax library (like xz from Google) to be a simple and powerful mechanism for shelling out from a more robust language.

I simplified my testing interface by wrapping the tome-cli command in a deno function, injected the necessary environmental variables, and pre-parse the outputs to make each test low repetition.

Function setup: https://github.com/zph/tome-cli/blob/main/test/tome-cli.test.ts#L27-L31

Example test: https://github.com/zph/tome-cli/blob/main/test/tome-cli.test.ts#L100-L108

This approach makes for easier E2E testing compared to bats because I have robust data types without the fragility/peculiarities of bash, while having a clearer UI for assertions and their diffs.

NLB Target Handling During pre-change and post-change Hooks

micro

I’ve been using a tool lately that provides good defaults for performing complex database operations, but found a few cases where we’d need to contribute upstream, fork the tool, or determine a generic way to extend it for our purposes.

There are a few ways to go here:

Golang plugins (or hashicorp/go-plugin)
Pre/post hooks for arbitrary shell scripts in the cli tool
Extend the cli.

My choice has been to do 3 in cases of shared utility for other users, 2 in simple cases and 1 for complex interactions or complex data.

On Reliability

micro engineering excellence principles

I read The Calculus of Service Availability: You’re only as available as the sum of your dependencies today and it summarizes some of the most salient wisdom in designing for reliability targets.

The takeaways are:

Humans have enough imperfections in their nearby systems that 4 or 5 9s of reliability is the maximum value worth targeting
Problems come from the the service itself or its critical dependencies.
Availability = MTTF/(MTTF+MTTR)
Rule of an extra 9 - Any service should rely on critical services that exceed their own SLA by 1x 9 of reliability, ie a 3x 9s service should only depend on 4x 9s services in its critical path.
When depending on services not meeting that threshold, it must be accounted for via resilient design
The math:

Assuming a service has error budget of 0.01%.
Choose 0.005% error budget for service and the 0.005% for critical dependencies, then 5 dependencies each get 1/5 of 0.005% or 0.001%, ie must be 99.999% available.
If calculating aggregate availability: 1x service and 2x dependency services of 99.99% yield 99.99 * 99.99 * 99.99 = 99.97% availability. This can be adjusted by improving availability through redundancy or removing hard dependencies.
Frequency * Detection * Recovery = the impact of outages and the feasibility of a given SLA.

Consequently, the levers for improvement are: reduce frequency of outage, blast radius (sharding, cellular architectures, or customer isolation), and MTTR.

I’ve been evolving our Online Database Platform at work and the themes of “rule of an extra 9” and how to move quickly as well as safely with limited blast radius are top of mind. The tradeoff here is complexity and a rule of thumb that each additional 9 costs 10x the effort/cost.

We’ve made some major changes (cluster topology, upgrades in nosql and sql, automation tooling) that are moving our stack to a point where I’m proud of the accomplishments.

Hundreds of TB of online data and hundreds of clusters managed by a team that I can count on one hand :).

Deno for Shell Scripting a Pagerduty helper

micro

Deno, a modern JS platform, worked well tonight as a scripting language and ecosystem for building a tiny cli interface for paging in teammates.

I will expand my use of it and replace usage of zx + js with vl + ts + Deno.

I prototyped the cli with a focus on reducing human cognitive overhead during stressful operations to provide sound defaults and reduced choice when choosing how many teammates to bring in, what to use as incident message, and details.

A few things made this easy:

Confidence in deploying this as a standalone tool thanks to Deno without dependency management issues
npx allowing for lazy loading of the pagerduty-cli tool to shell out to
vl as a deno library that emulates behavior from Google’s SRE tooling of zx. These are javascript libraries that make for convenient bash-like scripting in a js/ts environment.

Prototype script below:

Replacing inlined scripts with bundler inline

micro

git smart-pull is a great tool for avoiding the messiness of git rebases when there’s changed content.

I long ago inlined the full ruby gem into a single executable file to avoid the hassle of installing it in various ruby environments. It’s worked well!

Ruby 3.2.0 broke my script in a tiny way and broke the underlying gem. The patch has sat unmerged for months and now with bundler/inline I have a better solution than keeping a spare inlined script… forking the project and pointing a wrapper script with bundler/inline at my own repo.

I’m applying patches from PRs in upstream (eg hub merge https://github.com/geelen/git-smart/pull/25).

It’s an elegant solution that I’ll re-use for other ruby scripts in my development environment.

Automatically Warm EBS Volumes from Snapshots

micro ebs

We’re automating more of our cluster operations at work and here’s the procedure for warming an EBS volume created from a snapshot to avoid their first-read/write performance issues.

How it works

Follow instructions in gist for installing in root’s crontab as a @reboot run instruction. It uses an exclusive flock and a completion file to ensure idempotency.

3x Faster Mongodb Controlled Failovers

micro mongodb ops

I recently modified our failover protocol at work for MongoDB in a way that reduces the interruption from 14 seconds down to 3.5 seconds by altering election configurations ahead of controlled failovers. This was tested on a 3.4 cluster but should hold true up until modern versions. Starting in 4.0.2 it’s less valuable for controlled failovers but still useful as a tuning setting for uncontrolled failovers.

How it works

The premise is to make the shard call a new election as fast as possible by reducing electionTimeoutMillis and heartbeatIntervalMillis.

Procedure:

// on the primary
cfg = rs.conf()
cfg.settings["electionTimeoutMillis"] = 1000
cfg.settings["heartbeatIntervalMillis"] = 100
rs.reconfig(cfg)

// wait 60 seconds for propagation
rs.stepDown()

// wait for 60 seconds for election to settle
// connect to primary

cfg = rs.conf()
cfg.settings["electionTimeoutMillis"] = 10000
cfg.settings["heartbeatIntervalMillis"] = 1000
rs.reconfig(cfg)

This is valuable to tune also if you’re on high quality, low latency networks. You’re missing faster failovers in non-controlled circumstances every time mongo insists on waiting 10 seconds before allowing an election, even when receiving failing heartbeats.

PS - While browsing the docs I found this ^_^ which is non-intuitive since I would expect no writes to one shard but no impact to other shards. Presumably it’s a typo and cluster means replicaset.

During the election process, the cluster cannot accept write operations until it elects the new primary.

Use GEM_HOME for bundler inline

micro

Ruby’s bundler inline installs to --system destination and does not respect BUNDLE_PATH (as I would expect).

Errors will be about permission errors and server requesting root access for the bundler install.

Digging around in github issues, this is desired behavior: https://github.com/rubygems/bundler/pull/7154

Solution:

export GEM_HOME=vendor/bundle

Hugo to 0.105.0

micro

I’ve upgraded Hugo, ditched webpack for esbuild and removed most javascript from blog 🍋. The best part is that it now does syntax highlighting on the server side and builds in < 500ms 🐎.

While reading docs, I learned that Hugo allows for custom type blocks such as the mermaid diagramming in a code block below that’s rendered into an image. Unsure if this will have practical use in blog, but it’s a problem in search for an answer ;).

stateDiagram-v2 [*] --> Still Still --> [*] Still --> Moving Moving --> Still Moving --> Crash Crash --> [*]

xargs.io

All the IO and Multiplexing

06 Aug 2024

06 Aug 2024

22 Jul 2023

10 May 2023

12 Apr 2023

07 Apr 2023

06 Apr 2023

How it works

06 Apr 2023

How it works

28 Jan 2023

05 Nov 2022