Treat alerts like comments

My thread here was featured in SRE weekly! It started with Jacob quote-tweeting my soon-to-be boss, Charity, whose thread is also worth reading:

tl;dr: page when users are unhappy. Be careful about paging on behavior that is anomalous but may or may not correspond to user pain. E.g... SLOs and error budgets.

This won't save you from everything (e.g., https://t.co/tZ6zVWjtUf) but it is good "modern" table stakes. https://t.co/SJb09Y0y88
— Jacob (@jhscott) January 16, 2020

After following the discussion a bit, I chimed in with my thoughts:

This might be oversimplifying, but I feel like system resiliency can follow a similar pattern to the traditional advice for commenting your code:

"Write your code so it doesn't need comments, and then comment it anyway."

(Not trying to start a flame war here, bear with me.)
— shelby reliability engineer (@shelbyspees) January 16, 2020

@mipsytipsy's thought experiment was essentially that:

Instrument and observe your code as if you don't have pager alerts. Then add pager alerts back in.
— shelby reliability engineer (@shelbyspees) January 16, 2020

Alerts, like comments, are static. It takes extra cognitive resources to validate their value and correctness (compared to code, which can at least be run and tested).

Comments and documentation don't get updated when code does. The same is true for many pager alerts.
— shelby reliability engineer (@shelbyspees) January 16, 2020

My team has been getting a weird alert all week, numInputRows is too low. Systems were behaving fine, data was streaming fine.

Each day when it was triggered we'd query the DB to make sure data was arriving, and it was.
— shelby reliability engineer (@shelbyspees) January 16, 2020

Turns out when the alert was made, it was assumed we'd never have zero numInputRows. Inventory is low, now we have zero.

I started asking naive questions. What's the failure that this alert is supposed to catch? (Bonus, I learned things.)
— shelby reliability engineer (@shelbyspees) January 16, 2020

Validating that asynchronous data streams are arriving at their destination is nontrivial (everyone besides me probably already knows lol).

So we changed the code so the alert would actually alert on a steaming failure and not zero inventory.
— shelby reliability engineer (@shelbyspees) January 16, 2020

Which feels to me like changing code to make a comment become true.

Sometimes the code is wrong and the comment is right, sure. In our case, the alert is valuable enough for detecting that failure that it was worth changing the code to make it true.
— shelby reliability engineer (@shelbyspees) January 16, 2020

But we shouldn't be afraid to delete noisy, unhelpful comments and we shouldn't be afraid to axe unhelpful, noisy alerts.
— shelby reliability engineer (@shelbyspees) January 16, 2020

And finally, it shouldn't have taken us three days to verify that the data stream was healthy and the alert was signaling the wrong thing.

We can't control internal Kinesis health, but we should at least be able to observe the end to end.
— shelby reliability engineer (@shelbyspees) January 16, 2020

Comments should be true and helpful. Well-written code still needs comments, but you don't comment every line with what it's doing.

Alerts should be true and actionable. Well-instrumented systems still need pager alerts, but you don't alert on implementation details.
— shelby reliability engineer (@shelbyspees) January 16, 2020

That was long and I know I blew up everyone's mentions, but I'm feeling the alert fatigue, I see the poor signal/noise ratio.
— shelby reliability engineer (@shelbyspees) January 16, 2020

Pager alerts are important, but they shouldn't be the only way we know our systems are healthy. They're too lossy.@mipsytipsy is asking us to use the right tool for the job. Instrument and observe.
— shelby reliability engineer (@shelbyspees) January 16, 2020