2 minutes
SKILupDay Crowd Chat: What does observability in DevOps mean to you?
The ability to ask novel questions about your software systems without having to deploy new instrumentation. Same data, new questions. https://t.co/QZmFnpkTOS
— shelby reliability engineer (@shelbyspees) August 12, 2020
This requires telemetry in a data structure that can store lots of rich context, and can be easily queried later.
— shelby reliability engineer (@shelbyspees) August 12, 2020
And it requires tooling that enables lots of exploratory querying of this data, returning results quickly so you can iterate on your novel questions.
And it requires that you instrument your systems to send all that rich context all the time. And the richest context of all comes from your code.
— shelby reliability engineer (@shelbyspees) August 12, 2020
Instrument your code! Either with structured logging, or with SDKs that generate events from what your code is doing.
Metrics, logs, and traces as three separate things do not give you observability.
— shelby reliability engineer (@shelbyspees) August 12, 2020
Instrumenting for observability can give you metrics, logs, and traces.
If your raw data is stored as rich events, you can aggregate numeric fields into metrics or string fields into logs.
And if your events have a start time, duration, and parent-child calling relationships, you can generate trace visualizations.
— shelby reliability engineer (@shelbyspees) August 12, 2020
The observability part comes from the raw data. From your "metrics" graph of HTTP response codes, you can group by error message (logs), and then click through to a specific *trace*.
— shelby reliability engineer (@shelbyspees) August 12, 2020
That's what observability tooling makes possible. And it requires structured raw data.
Error message is boring though, we already know what the response code was. Let's group by endpoint. Are certain parts of the app behaving strangely?
— shelby reliability engineer (@shelbyspees) August 12, 2020
It looks like this endpoint is having a lot of errors. Is one customer experiencing it worse? Let's group by customer.
Etc.
The three pillars can't give you that.
— shelby reliability engineer (@shelbyspees) August 12, 2020
334 Words
2020-08-12 16:00 (Last updated: 2020-09-20 08:36)
03329a4
@
2020-09-20