Alex Hidalgo: I Have an SLO. Now What?

Watching this now! https://t.co/GKJD9FI8Qo
— shelby reliability engineer (@shelbyspees) August 13, 2020

With logs, metrics, and traces on distributed systems, it's easy to end up with information overload.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Reliability stack starts with service level indicators (SLIs)

- measurement used to define how a service is operating
- often calculated via a combination of many different things

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Service level objective (SLO)
- target percentage informed by SLI
- often with a threshold involved
- nothing is ever 100% reliable, and it's too expensive to try to get there
- use a reasonable percentage

99.999% reliability is ~5 mins of downtime per year

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Error budget
- calculate how well your SLO has performed over a time window
- helps you have discussions about how reliable you've been, use narrative form:
99.9% SLO == 0.1% budget == "43.2 acceptable bad minutes every 30 days"

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Service level agreements (SLAs) are contractual agreements, involving lawyers and finance. Not what we're talking about here.
— shelby reliability engineer (@shelbyspees) August 13, 2020

The reliability stack measures *service* reliability

Your service has one job: do what your users need it to do well enough and enough of the time (not 100% of the time).

on Netflix: 20 second buffer one time is no big deal. every time? that's a bad experience

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

SLOs lead to happier people
- users: service does what they need
- engineers: reduce toil, alert fatigue
- product teams: PMs might have a "user journey" doc that describes your SLI
- business: might have a KPI that maps to your SLI

Get everyone on the same page!

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Classic example from SRE book:
- error budget surplus? ship features!
- error budget exceeded? stop! fix reliability!

Let's stop thinking about things this way.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

There's nothing wrong with the classic model, but you have to be careful. Dev teams usually can't stop feature work.

Causes back pressure: Once you reopen your error budget, you release all at once and burn through your error budget again.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Freezing the release pipeline is dangerous. Instead, talk to your team. These are nuanced decisions.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Project work focus: reliability improvements *are* features.

Not everyone owns the code they run--classic model assumes a certain relationship among teams. E.g. OSS projects

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Calculating and measuring better SLIs *is* project work. So is picking better better SLO threshold. Examine your measurements often, they could be very wrong.

It's not set-it-and-forget-it. Your measurements won't be perfect out the gate (or ever!).

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Examining risk factors: error budget burn lets your identify where you're not being reliable, even for short windows. This allows you to determine what your greatest risks are.

Error budget burn may or may not be a problem, it's up to you to discuss.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Experimentation and chaos engineering
- you should try to break your service! not taking it 100% down, but breaking parts of it to learn more about how it responds in reality
- but don't be too cavalier if you haven't been running reliably

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

It's not just about planning either. While performing experiments, your SLO data tells you more about the impact to service reliability.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

All changes are experiments! Changing your algorithm, garbage collection, infra. Use your SLO data as a feedback loop.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Load tests and stress tests require coordination across multiple teams. You *need* SLO data to understand how your service behaves across components. Individual metrics and logs don't answer reliability questions.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

When performing a stress test, the interesting question to answer is: "Where on the curve does the error budget get impacted?" How badly can things go before it impacts service reliability?

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Alex discusses the Google story of shutting off the Chubby service once a quarter specifically to burn through their error budget.

Important way to learn about hidden dependencies. (The scream test 😱)

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Do nothing:
- sometimes the numbers or measurements are wrong
- sometimes you already know you have a problem and it'll take a long time to fix
- SLOs are data, not mandates. it's about having conversations

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

How do you determine reliability over time?
- incident counting?
- MTT[X] (mean time to [X])

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

If your OKR says to reduce the number of incidents:

Q1: 20 incidents
Q2: 10 incidents

Q1: 120 mins of downtime
Q2: 150 mins of downtime

That's not a better experience

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

With MTTX
- recover/remediation/recovery/response/etc
- failure, detection, engagement...

What do we even care about? There are so many nuances, this metric ends up being meaningless.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

With MTTX

Q1: 20 incidents with 120 mins of unreliability, avg 6m each
Q2: 10 incidents with 150 mins of unreliability, avg 15m each

but Q2 was not 2x as bad as Q1
Simple math can lead you down the wrong path.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

Error budgets automatically give you the best way to report reliability over time.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

The most important thing: SLO data is about having better conversations. It gives us indicators of the user's perspective.

Better conversations leads to making better decisions.

- @ahidalgosre
— shelby reliability engineer (@shelbyspees) August 13, 2020

The @OReillyMedia book is at the printers, pre-order the physical copy! https://t.co/tPKLrPPIRk
— shelby reliability engineer (@shelbyspees) August 13, 2020