treat alerts like comments

treat alerts like comments

shelby spees's photo
shelby spees
·Jan 16, 2020·

3 min read

My twitter thread](twitter.com/shelbyspees/status/121786746734..) was featured in SRE weekly!

It started with Jacob quote-tweeting my soon-to-be boss, Charity, whose thread is also worth reading:

After following the discussion a bit, I chimed in with my thoughts. (Original thread edited for readability.)


This might be oversimplifying, but I feel like system resiliency can follow a similar pattern to the traditional advice for commenting your code: "Write your code so it doesn't need comments, and then comment it anyway." (Not trying to start a flame war here, bear with me.)

@mipsytipsy's thought experiment was essentially that: Instrument and observe your code as if you don't have pager alerts. Then add pager alerts back in.

Alerts, like comments, are static. It takes extra cognitive resources to validate their value and correctness (compared to code, which can at least be run and tested). Comments and documentation don't get updated when code does. The same is true for many pager alerts.

My team has been getting a weird alert all week, numInputRows is too low. Systems were behaving fine, data was streaming fine. Each day when it was triggered we'd query the DB to make sure data was arriving, and it was.

Turns out when the alert was made, it was assumed we'd never have zero numInputRows. Inventory is low, now we have zero. I started asking naive questions. What's the failure that this alert is supposed to catch? (Bonus, I learned things.)

Validating that asynchronous data streams are arriving at their destination is nontrivial (everyone besides me probably already knows lol). So we changed the code so the alert would actually alert on a steaming failure and not zero inventory.

Which feels to me like changing code to make a comment become true. Sometimes the code is wrong and the comment is right, sure. In our case, the alert is valuable enough for detecting that failure that it was worth changing the code to make it true.

But we shouldn't be afraid to delete noisy, unhelpful comments and we shouldn't be afraid to axe unhelpful, noisy alerts.

And finally, it shouldn't have taken us three days to verify that the data stream was healthy and the alert was signaling the wrong thing. We can't control internal Kinesis health, but we should at least be able to observe the end to end.

Comments should be true and helpful. Well-written code still needs comments, but you don't comment every line with what it's doing. Alerts should be true and actionable. Well-instrumented systems still need pager alerts, but you don't alert on implementation details.

That was long and I know I blew up everyone's mentions, but I'm feeling the alert fatigue, I see the poor signal/noise ratio. Pager alerts are important, but they shouldn't be the only way we know our systems are healthy. They're too lossy. @mipsytipsy is asking us to use the right tool for the job. Instrument and observe.

 
Share this