Setting a Service Level Objective for your service is only the start of your quantified reliability journey. What do you do when you've had a few too many incidents and blown your error budget? Or had a pile of near-misses that burned the team out even though the user-facing SLO wasn't violated? What if the incident trigger was the infrastructure refactoring meant to improve, not harm, reliability & maintainability?In this talk, you'll learn how the team at Honeycomb handles incidents, chaos engineering, and the engineering feedback loop for reliability with social practices and architectural design.

Managing to Your SLO in the Face of Chaos

Liz Fong-Jones