6 minutes
Alex Hidalgo: I Have an SLO. Now What?
Watching this now! https://t.co/GKJD9FI8Qo
— shelby reliability engineer (@shelbyspees) August 13, 2020
With logs, metrics, and traces on distributed systems, it's easy to end up with information overload.
— shelby reliability engineer (@shelbyspees) August 13, 2020
- @ahidalgosre
Reliability stack starts with service level indicators (SLIs)
— shelby reliability engineer (@shelbyspees) August 13, 2020
- measurement used to define how a service is operating
- often calculated via a combination of many different things
- @ahidalgosre
Service level objective (SLO)
— shelby reliability engineer (@shelbyspees) August 13, 2020
- target percentage informed by SLI
- often with a threshold involved
- nothing is ever 100% reliable, and it's too expensive to try to get there
- use a reasonable percentage
99.999% reliability is ~5 mins of downtime per year
- @ahidalgosre
Error budget
— shelby reliability engineer (@shelbyspees) August 13, 2020
- calculate how well your SLO has performed over a time window
- helps you have discussions about how reliable you've been, use narrative form:
99.9% SLO == 0.1% budget == "43.2 acceptable bad minutes every 30 days"
- @ahidalgosre
Service level agreements (SLAs) are contractual agreements, involving lawyers and finance. Not what we're talking about here.
— shelby reliability engineer (@shelbyspees) August 13, 2020
The reliability stack measures *service* reliability
— shelby reliability engineer (@shelbyspees) August 13, 2020
Your service has one job: do what your users need it to do well enough and enough of the time (not 100% of the time).
on Netflix: 20 second buffer one time is no big deal. every time? that's a bad experience
- @ahidalgosre
SLOs lead to happier people
— shelby reliability engineer (@shelbyspees) August 13, 2020
- users: service does what they need
- engineers: reduce toil, alert fatigue
- product teams: PMs might have a "user journey" doc that describes your SLI
- business: might have a KPI that maps to your SLI
Get everyone on the same page!
- @ahidalgosre
Classic example from SRE book:
— shelby reliability engineer (@shelbyspees) August 13, 2020
- error budget surplus? ship features!
- error budget exceeded? stop! fix reliability!
Let's stop thinking about things this way.
- @ahidalgosre
There's nothing wrong with the classic model, but you have to be careful. Dev teams usually can't stop feature work.
— shelby reliability engineer (@shelbyspees) August 13, 2020
Causes back pressure: Once you reopen your error budget, you release all at once and burn through your error budget again.
- @ahidalgosre
Freezing the release pipeline is dangerous. Instead, talk to your team. These are nuanced decisions.
— shelby reliability engineer (@shelbyspees) August 13, 2020
- @ahidalgosre
Project work focus: reliability improvements *are* features.
— shelby reliability engineer (@shelbyspees) August 13, 2020
Not everyone owns the code they run--classic model assumes a certain relationship among teams. E.g. OSS projects
- @ahidalgosre
Calculating and measuring better SLIs *is* project work. So is picking better better SLO threshold. Examine your measurements often, they could be very wrong.
— shelby reliability engineer (@shelbyspees) August 13, 2020
It's not set-it-and-forget-it. Your measurements won't be perfect out the gate (or ever!).
- @ahidalgosre
Examining risk factors: error budget burn lets your identify where you're not being reliable, even for short windows. This allows you to determine what your greatest risks are.
— shelby reliability engineer (@shelbyspees) August 13, 2020
Error budget burn may or may not be a problem, it's up to you to discuss.
- @ahidalgosre
Experimentation and chaos engineering
— shelby reliability engineer (@shelbyspees) August 13, 2020
- you should try to break your service! not taking it 100% down, but breaking parts of it to learn more about how it responds in reality
- but don't be too cavalier if you haven't been running reliably
- @ahidalgosre
It's not just about planning either. While performing experiments, your SLO data tells you more about the impact to service reliability.
— shelby reliability engineer (@shelbyspees) August 13, 2020
- @ahidalgosre
All changes are experiments! Changing your algorithm, garbage collection, infra. Use your SLO data as a feedback loop.
— shelby reliability engineer (@shelbyspees) August 13, 2020
- @ahidalgosre
Load tests and stress tests require coordination across multiple teams. You *need* SLO data to understand how your service behaves across components. Individual metrics and logs don't answer reliability questions.
— shelby reliability engineer (@shelbyspees) August 13, 2020
- @ahidalgosre
When performing a stress test, the interesting question to answer is: "Where on the curve does the error budget get impacted?" How badly can things go before it impacts service reliability?
— shelby reliability engineer (@shelbyspees) August 13, 2020
- @ahidalgosre
Alex discusses the Google story of shutting off the Chubby service once a quarter specifically to burn through their error budget.
— shelby reliability engineer (@shelbyspees) August 13, 2020
Important way to learn about hidden dependencies. (The scream test 😱)
- @ahidalgosre
Do nothing:
— shelby reliability engineer (@shelbyspees) August 13, 2020
- sometimes the numbers or measurements are wrong
- sometimes you already know you have a problem and it'll take a long time to fix
- SLOs are data, not mandates. it's about having conversations
- @ahidalgosre
How do you determine reliability over time?
— shelby reliability engineer (@shelbyspees) August 13, 2020
- incident counting?
- MTT[X] (mean time to [X])
- @ahidalgosre
If your OKR says to reduce the number of incidents:
— shelby reliability engineer (@shelbyspees) August 13, 2020
Q1: 20 incidents
Q2: 10 incidents
Q1: 120 mins of downtime
Q2: 150 mins of downtime
That's not a better experience
- @ahidalgosre
With MTTX
— shelby reliability engineer (@shelbyspees) August 13, 2020
- recover/remediation/recovery/response/etc
- failure, detection, engagement...
What do we even care about? There are so many nuances, this metric ends up being meaningless.
- @ahidalgosre
With MTTX
— shelby reliability engineer (@shelbyspees) August 13, 2020
Q1: 20 incidents with 120 mins of unreliability, avg 6m each
Q2: 10 incidents with 150 mins of unreliability, avg 15m each
but Q2 was not 2x as bad as Q1
Simple math can lead you down the wrong path.
- @ahidalgosre
Error budgets automatically give you the best way to report reliability over time.
— shelby reliability engineer (@shelbyspees) August 13, 2020
- @ahidalgosre
The most important thing: SLO data is about having better conversations. It gives us indicators of the user's perspective.
— shelby reliability engineer (@shelbyspees) August 13, 2020
Better conversations leads to making better decisions.
- @ahidalgosre
The @OReillyMedia book is at the printers, pre-order the physical copy! https://t.co/tPKLrPPIRk
— shelby reliability engineer (@shelbyspees) August 13, 2020
live tweeting site reliability engineering service level objectives
1084 Words
2020-08-13 18:07 (Last updated: 2020-09-20 08:36)
03329a4
@
2020-09-20