Liz Fong-Jones: Refining Systems Data without Losing Fidelity

We're live! Watching @lizthegrey's talk: "Refining Systems Data without Losing Fidelity"https://t.co/RDCrPKXaLO
— shelby reliability engineer (@shelbyspees) July 22, 2020

Come watch! https://t.co/oQscGFYkiO
— shelby reliability engineer (@shelbyspees) July 22, 2020

Complex systems are hard to manage. Things are getting increasingly complex, hence SREs.

In order to manage these complex systems, we need SLOs and o11y.
— shelby reliability engineer (@shelbyspees) July 22, 2020

o11y tends to be expensive. We need different kinds of telemetry data. We need
- user data
- host metrics
— shelby reliability engineer (@shelbyspees) July 22, 2020

Most problems are not per-host. You can't just say, "this host is misbehaving, I'm gonna restart it," your k8s pod would have handled that.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

We need rich context in order to debug our complex distributed systems.

You cannot decompose your system and have any hope of understanding it. You need it all in context.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Shout out to @emilywithcurls for all the beautiful illustrations in all of @lizthegrey's talks 🎨
— shelby reliability engineer (@shelbyspees) July 22, 2020

As SREs, we often have to answer questions that look like "how many" or "how much"

Percentiles and distributions ways of answering those questions.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Have you ever played this game? (img alt: jar with marbles) pic.twitter.com/ypaS4OSKqk
— shelby reliability engineer (@shelbyspees) July 22, 2020

If you're trying to answer "how many marbles are in the jar?" or "how many yellow marbles?" you'd have to spend all day guessing.
— shelby reliability engineer (@shelbyspees) July 22, 2020

We can apply "reduce, reuse, recycle" in the context of SRE.

img alt: three circling arrows above the words "Reduce. Reuse. Recycle."

- @lizthegrey pic.twitter.com/EPbRdmJngK
— shelby reliability engineer (@shelbyspees) July 22, 2020

Reducing: store less data.

A lot of what we write has duplicates or it's meant to be read by humans, not computers.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Stop writing read-never data! How long do we need our debug logs for? You can save them for e.g. 24-48 hours, don't keep them forever.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Structure your data! Allows you to query for the fields you care about.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Log one event per transaction and contain all the related transaction context within that same event.

Use distributed tracing to connect together related events across your systems, including calling relationships.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Sample your data! Use statistics, e.g. polling a representative sample of the public to get a sense of trends across the greater population.

Count 1/N events.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

If you're keeping 1/6 events and throwing out the other 5, you can trust that you'll get representative examples of the 5 you threw out over time

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Count traces together, e.g.
- if trace_id ends with a 0, keep it
- make a decision at the head of the trace and sample the trace in full

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Don't be afraid of sample rates. 1/10,000 doesn't have to be scary

e.g. the population size doesn't really matter, just the variance within the sample. Ask data scientists!

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

inverse quantiles and SLOs are sample-safe

inverse quantile means if you pick a threshold e.g. 300 ms, what percentage of your events are below the threshold

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

it's not perfect, but it gives us confidence. trying to be perfect is disproportionately costly and doesn't give us that much more o11y

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Aggregation destroys cardinality! You can grind up all the marbles and weigh the dust for each color

img alt: scale weighing colored marble dust with text "This has mixed results."

- @lizthegrey pic.twitter.com/omCvSW6Wf5
— shelby reliability engineer (@shelbyspees) July 22, 2020

It's very cheap to answer known questions! But it's inflexible for new questions.

Once the marbles are ground up, you can't go back and ask "how many blue cats-eyes were there?"

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Storing the old data in full before aggregating it is also expensive.

On top of that, temporal correlation is weak.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

ML (AIOps) is trying to make decisions based on data that's already been destroyed. It's throwing darts at the problem, not validating against known event data in full.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

When you bucket data w/ quantiles, you lose fidelity re: the original context of that data
— shelby reliability engineer (@shelbyspees) July 22, 2020

Aggregation should be a last resort

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

How do make the cost of sampled events consistent over time? Cost scales with event volume.

Need to make sure
- cost is consistent over time
- can debug with trace data

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Adjust the sample rate based on the last minute of traffic levels. This allows us to keep a consistent number of events post-sampling to keep costs predictable and low.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Can also do per-key sampling. 99%+ of events are low in signal, you don't care about the fast/good as much as the slow/bad

You don't need 99x fast data as slow data, you can just keep a representative sample.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

But every customer is unique. If you have a customer that's 100% down but it's so small that it's only 1% of your traffic, you might not catch that. Dashboards won't show that.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

To solve for this, we need to normalize per-key. We need to make sure it's allocated fairly across all our clients.

Different keys can have different probabilities. Downsample voluminous customers, keep the slow, the errors.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Buffering allows us to decide which traces are interesting and keep only those, discard the rest. (This is called tail-based sampling.)

Without buffering, you have to do decide at the head of the trace.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

It's okay for your clients' sample rates to change over time.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

That low-volume data is so precious. That's what we want to keep and earn from.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Exemplar: Distribution + Sampled Events

img alt: three buckets containing marble dust, each of which is labeled with the type of marble the dust was produced from

- @lizthegrey pic.twitter.com/hCaTjzTX2S
— shelby reliability engineer (@shelbyspees) July 22, 2020

You don't need to keep each individual field of every single event, you keep properties of a representative event + the amount there was.

This is the same concept! This is sampling!

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

Metrics and events can be friends 🌈

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

We prevent data spew by reducing, recycling, and reusing our data.

Structure your data.
Sample your data.
If you need to, aggregate for known questions but keep the original structure so you can ask new questions as well.

- @lizthegrey
— shelby reliability engineer (@shelbyspees) July 22, 2020

That was an awesome talk and it covered a ton of material, I'm probably gonna watch it again and write up some notes on the tail-based sampling approaches.
— shelby reliability engineer (@shelbyspees) July 23, 2020