3 minutes
Treat alerts like comments
My thread here was featured in SRE weekly! It started with Jacob quote-tweeting my soon-to-be boss, Charity, whose thread is also worth reading:
tl;dr: page when users are unhappy. Be careful about paging on behavior that is anomalous but may or may not correspond to user pain. E.g... SLOs and error budgets.
— Jacob (@jhscott) January 16, 2020
This won't save you from everything (e.g., https://t.co/tZ6zVWjtUf) but it is good "modern" table stakes. https://t.co/SJb09Y0y88
After following the discussion a bit, I chimed in with my thoughts:
This might be oversimplifying, but I feel like system resiliency can follow a similar pattern to the traditional advice for commenting your code:
— shelby reliability engineer (@shelbyspees) January 16, 2020
"Write your code so it doesn't need comments, and then comment it anyway."
(Not trying to start a flame war here, bear with me.)
@mipsytipsy's thought experiment was essentially that:
— shelby reliability engineer (@shelbyspees) January 16, 2020
Instrument and observe your code as if you don't have pager alerts. Then add pager alerts back in.
Alerts, like comments, are static. It takes extra cognitive resources to validate their value and correctness (compared to code, which can at least be run and tested).
— shelby reliability engineer (@shelbyspees) January 16, 2020
Comments and documentation don't get updated when code does. The same is true for many pager alerts.
My team has been getting a weird alert all week, numInputRows is too low. Systems were behaving fine, data was streaming fine.
— shelby reliability engineer (@shelbyspees) January 16, 2020
Each day when it was triggered we'd query the DB to make sure data was arriving, and it was.
Turns out when the alert was made, it was assumed we'd never have zero numInputRows. Inventory is low, now we have zero.
— shelby reliability engineer (@shelbyspees) January 16, 2020
I started asking naive questions. What's the failure that this alert is supposed to catch? (Bonus, I learned things.)
Validating that asynchronous data streams are arriving at their destination is nontrivial (everyone besides me probably already knows lol).
— shelby reliability engineer (@shelbyspees) January 16, 2020
So we changed the code so the alert would actually alert on a steaming failure and not zero inventory.
Which feels to me like changing code to make a comment become true.
— shelby reliability engineer (@shelbyspees) January 16, 2020
Sometimes the code is wrong and the comment is right, sure. In our case, the alert is valuable enough for detecting that failure that it was worth changing the code to make it true.
But we shouldn't be afraid to delete noisy, unhelpful comments and we shouldn't be afraid to axe unhelpful, noisy alerts.
— shelby reliability engineer (@shelbyspees) January 16, 2020
And finally, it shouldn't have taken us three days to verify that the data stream was healthy and the alert was signaling the wrong thing.
— shelby reliability engineer (@shelbyspees) January 16, 2020
We can't control internal Kinesis health, but we should at least be able to observe the end to end.
Comments should be true and helpful. Well-written code still needs comments, but you don't comment every line with what it's doing.
— shelby reliability engineer (@shelbyspees) January 16, 2020
Alerts should be true and actionable. Well-instrumented systems still need pager alerts, but you don't alert on implementation details.
That was long and I know I blew up everyone's mentions, but I'm feeling the alert fatigue, I see the poor signal/noise ratio.
— shelby reliability engineer (@shelbyspees) January 16, 2020
Pager alerts are important, but they shouldn't be the only way we know our systems are healthy. They're too lossy.@mipsytipsy is asking us to use the right tool for the job. Instrument and observe.
— shelby reliability engineer (@shelbyspees) January 16, 2020
625 Words
2020-01-16 16:53 (Last updated: 2020-09-20 08:36)
03329a4
@
2020-09-20