12 Sep 2016

Good Dull Best Practices in Operations

Tonight I read this paper: pdf or archive.

And I was impressed by the boring and solid guidelines therein.

My takeaways were:

  • Immutable/recreatable servers and infrastructure. Poignant b/c of an EC2 hardware failure today and needing to recover by snapshotting the root drive and reattaching to new server.
  • Instrument all the things.
  • Spread testing across unit, integration and multi-service tests
  • Gradual deployment of new services/updates. As a colleague and friend would say, “bake it in production for a little while”.
  • “Proven technology is almost always better than operating on the bleeding edge. Stable software is better than an early copy, no matter how valuable the new feature seems.” Bleeding edge is named that way for a reason. I find this tension between proven technology and bleeding edge to be on my mind lately.