7 minutes
Liz Fong-Jones: Refining Systems Data without Losing Fidelity
We're live! Watching @lizthegrey's talk: "Refining Systems Data without Losing Fidelity"https://t.co/RDCrPKXaLO
— shelby reliability engineer (@shelbyspees) July 22, 2020
Come watch! https://t.co/oQscGFYkiO
— shelby reliability engineer (@shelbyspees) July 22, 2020
Complex systems are hard to manage. Things are getting increasingly complex, hence SREs.
— shelby reliability engineer (@shelbyspees) July 22, 2020
In order to manage these complex systems, we need SLOs and o11y.
o11y tends to be expensive. We need different kinds of telemetry data. We need
— shelby reliability engineer (@shelbyspees) July 22, 2020
- user data
- host metrics
Most problems are not per-host. You can't just say, "this host is misbehaving, I'm gonna restart it," your k8s pod would have handled that.
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
We need rich context in order to debug our complex distributed systems.
— shelby reliability engineer (@shelbyspees) July 22, 2020
You cannot decompose your system and have any hope of understanding it. You need it all in context.
- @lizthegrey
Shout out to @emilywithcurls for all the beautiful illustrations in all of @lizthegrey's talks 🎨
— shelby reliability engineer (@shelbyspees) July 22, 2020
As SREs, we often have to answer questions that look like "how many" or "how much"
— shelby reliability engineer (@shelbyspees) July 22, 2020
Percentiles and distributions ways of answering those questions.
- @lizthegrey
Have you ever played this game? (img alt: jar with marbles) pic.twitter.com/ypaS4OSKqk
— shelby reliability engineer (@shelbyspees) July 22, 2020
If you're trying to answer "how many marbles are in the jar?" or "how many yellow marbles?" you'd have to spend all day guessing.
— shelby reliability engineer (@shelbyspees) July 22, 2020
We can apply "reduce, reuse, recycle" in the context of SRE.
— shelby reliability engineer (@shelbyspees) July 22, 2020
img alt: three circling arrows above the words "Reduce. Reuse. Recycle."
- @lizthegrey pic.twitter.com/EPbRdmJngK
Reducing: store less data.
— shelby reliability engineer (@shelbyspees) July 22, 2020
A lot of what we write has duplicates or it's meant to be read by humans, not computers.
- @lizthegrey
Stop writing read-never data! How long do we need our debug logs for? You can save them for e.g. 24-48 hours, don't keep them forever.
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
Structure your data! Allows you to query for the fields you care about.
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
Log one event per transaction and contain all the related transaction context within that same event.
— shelby reliability engineer (@shelbyspees) July 22, 2020
Use distributed tracing to connect together related events across your systems, including calling relationships.
- @lizthegrey
Sample your data! Use statistics, e.g. polling a representative sample of the public to get a sense of trends across the greater population.
— shelby reliability engineer (@shelbyspees) July 22, 2020
Count 1/N events.
- @lizthegrey
If you're keeping 1/6 events and throwing out the other 5, you can trust that you'll get representative examples of the 5 you threw out over time
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
Count traces together, e.g.
— shelby reliability engineer (@shelbyspees) July 22, 2020
- if trace_id ends with a 0, keep it
- make a decision at the head of the trace and sample the trace in full
- @lizthegrey
Don't be afraid of sample rates. 1/10,000 doesn't have to be scary
— shelby reliability engineer (@shelbyspees) July 22, 2020
e.g. the population size doesn't really matter, just the variance within the sample. Ask data scientists!
- @lizthegrey
inverse quantiles and SLOs are sample-safe
— shelby reliability engineer (@shelbyspees) July 22, 2020
inverse quantile means if you pick a threshold e.g. 300 ms, what percentage of your events are below the threshold
- @lizthegrey
it's not perfect, but it gives us confidence. trying to be perfect is disproportionately costly and doesn't give us that much more o11y
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
Aggregation destroys cardinality! You can grind up all the marbles and weigh the dust for each color
— shelby reliability engineer (@shelbyspees) July 22, 2020
img alt: scale weighing colored marble dust with text "This has mixed results."
- @lizthegrey pic.twitter.com/omCvSW6Wf5
It's very cheap to answer known questions! But it's inflexible for new questions.
— shelby reliability engineer (@shelbyspees) July 22, 2020
Once the marbles are ground up, you can't go back and ask "how many blue cats-eyes were there?"
- @lizthegrey
Storing the old data in full before aggregating it is also expensive.
— shelby reliability engineer (@shelbyspees) July 22, 2020
On top of that, temporal correlation is weak.
- @lizthegrey
ML (AIOps) is trying to make decisions based on data that's already been destroyed. It's throwing darts at the problem, not validating against known event data in full.
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
When you bucket data w/ quantiles, you lose fidelity re: the original context of that data
— shelby reliability engineer (@shelbyspees) July 22, 2020
Aggregation should be a last resort
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
How do make the cost of sampled events consistent over time? Cost scales with event volume.
— shelby reliability engineer (@shelbyspees) July 22, 2020
Need to make sure
- cost is consistent over time
- can debug with trace data
- @lizthegrey
Adjust the sample rate based on the last minute of traffic levels. This allows us to keep a consistent number of events post-sampling to keep costs predictable and low.
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
Can also do per-key sampling. 99%+ of events are low in signal, you don't care about the fast/good as much as the slow/bad
— shelby reliability engineer (@shelbyspees) July 22, 2020
You don't need 99x fast data as slow data, you can just keep a representative sample.
- @lizthegrey
But every customer is unique. If you have a customer that's 100% down but it's so small that it's only 1% of your traffic, you might not catch that. Dashboards won't show that.
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
To solve for this, we need to normalize per-key. We need to make sure it's allocated fairly across all our clients.
— shelby reliability engineer (@shelbyspees) July 22, 2020
Different keys can have different probabilities. Downsample voluminous customers, keep the slow, the errors.
- @lizthegrey
Buffering allows us to decide which traces are interesting and keep only those, discard the rest. (This is called tail-based sampling.)
— shelby reliability engineer (@shelbyspees) July 22, 2020
Without buffering, you have to do decide at the head of the trace.
- @lizthegrey
It's okay for your clients' sample rates to change over time.
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
That low-volume data is so precious. That's what we want to keep and earn from.
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
Exemplar: Distribution + Sampled Events
— shelby reliability engineer (@shelbyspees) July 22, 2020
img alt: three buckets containing marble dust, each of which is labeled with the type of marble the dust was produced from
- @lizthegrey pic.twitter.com/hCaTjzTX2S
You don't need to keep each individual field of every single event, you keep properties of a representative event + the amount there was.
— shelby reliability engineer (@shelbyspees) July 22, 2020
This is the same concept! This is sampling!
- @lizthegrey
Metrics and events can be friends 🌈
— shelby reliability engineer (@shelbyspees) July 22, 2020
- @lizthegrey
We prevent data spew by reducing, recycling, and reusing our data.
— shelby reliability engineer (@shelbyspees) July 22, 2020
Structure your data.
Sample your data.
If you need to, aggregate for known questions but keep the original structure so you can ask new questions as well.
- @lizthegrey
That was an awesome talk and it covered a ton of material, I'm probably gonna watch it again and write up some notes on the tail-based sampling approaches.
— shelby reliability engineer (@shelbyspees) July 23, 2020
1335 Words
2020-07-22 23:01 (Last updated: 2020-09-20 08:36)
03329a4
@
2020-09-20