SKILupDay Crowd Chat: What does observability in DevOps mean to you?

The ability to ask novel questions about your software systems without having to deploy new instrumentation. Same data, new questions. https://t.co/QZmFnpkTOS
— shelby reliability engineer (@shelbyspees) August 12, 2020

This requires telemetry in a data structure that can store lots of rich context, and can be easily queried later.

And it requires tooling that enables lots of exploratory querying of this data, returning results quickly so you can iterate on your novel questions.
— shelby reliability engineer (@shelbyspees) August 12, 2020

And it requires that you instrument your systems to send all that rich context all the time. And the richest context of all comes from your code.

Instrument your code! Either with structured logging, or with SDKs that generate events from what your code is doing.
— shelby reliability engineer (@shelbyspees) August 12, 2020

Metrics, logs, and traces as three separate things do not give you observability.

Instrumenting for observability can give you metrics, logs, and traces.

If your raw data is stored as rich events, you can aggregate numeric fields into metrics or string fields into logs.
— shelby reliability engineer (@shelbyspees) August 12, 2020

And if your events have a start time, duration, and parent-child calling relationships, you can generate trace visualizations.
— shelby reliability engineer (@shelbyspees) August 12, 2020

The observability part comes from the raw data. From your "metrics" graph of HTTP response codes, you can group by error message (logs), and then click through to a specific *trace*.

That's what observability tooling makes possible. And it requires structured raw data.
— shelby reliability engineer (@shelbyspees) August 12, 2020

Error message is boring though, we already know what the response code was. Let's group by endpoint. Are certain parts of the app behaving strangely?

It looks like this endpoint is having a lot of errors. Is one customer experiencing it worse? Let's group by customer.

Etc.
— shelby reliability engineer (@shelbyspees) August 12, 2020

The three pillars can't give you that.
— shelby reliability engineer (@shelbyspees) August 12, 2020