Notes on an Observability Team - by Eric Mustin

https://ericmustin.substack.com/p/notes-on-an-observability-team?utm_source=%2Fprofile%2F12139187-eric-mustin&utm_medium=reader2

I think the concept of an Observability Team within tech companies is relatively new. Or, maybe it's been around for awhile, but as a part time focus of a broad Production Engineering or SRE group, not as a dedicated role. So, while SRE and DevOps type job explainers have been written ad nauseam, I found there's relatively little online about Observability Teams and roles. I figured I'd share a bit about my experience on an O11y Team. Be the change and all that.

Inquiring minds want to know!

Here's some context. I've been working at a BigCo type of place this past year. We emit many terabytes of Observability data a day across many formats, not unlike other BigCo type places. This data is ingested, stored, and transported via a mix of vendor, OSS, and home grown software. I'm not particularly senior or influential, I'm just a humble IC, hacking away in the code mines, working with users, etc. Before this, I worked at an Observability vendor that sold software to folks trying to manage their Observability data. In my free time, like a real Poindexter, I like to maintain and contribute to OSS Observability projects. With that in mind, take my thoughts as what they are, lived experience, which may or may not apply directly to your organization and use case. I am not the cosmos.

So anyway, what's the point of an Observability Team? Well, first off, that would depend on what we mean by capital-O Observability. Observability, while certainly popularized by Charity Majors, is most succinctly described by Brendan Gregg's excellent blog post. Here's my favorite line:

Let me try some observability first. (Means: Let me look at the system without changing it.)

Ok, so in theory Observability is being able to derive insights about your software systems without having to go back after the fact and change the system. In practice, this looks a lot like collecting Monitoring data, but that you can query flexibly and creatively at read time.

The point of an Observability Team is to help deliver that aforementioned Observability to all the other engineers at the company, in a curated, color inside the lines way. It's Bark Box for Observability tooling, with two priorities in mind, Enablement and Cost Control. I’ll explain these priorities in a minute, but that's pretty much the gist of it.

Not every company is going to need an Observability Team. These priorities might be something a few folks can deliver part time, or even just one person, with a few high leverage SaaS investments, a few spare cycles, and a culture of Observability within the engineering department. But at a certain scale I think it becomes a reasonable investment or full time focus.

It's also an easily misunderstood team within a company. By creating an Observability team, or giving certain folks a mandate to focus on Observability, is the implicit message that others shouldn't care? I hope not, but I think that can happen. Observability is something every software engineer should practice, whether that means putting thought into what they log, what infra metrics and SLOs they monitor, what they trace and instrument, what they profile, and so on.

An Observability team should be a compliment to, not a replacement for, a strong Observability culture with an engineering team. This means that an Observability team shouldn't be off in a corner, trying to abstract away all the Observability work. The goal shouldn’t be a pixel perfect, all in one dashboard for every application and service that magically summarizes all relevant data points into a few Top 10 lists, and answers every question an engineer could have, before they even know to ask it. The goal shouldn't be that the folks working on the domain specific stuff your company is known for, Payments or E-commerce or NFTs or What-Have-You, don't ever need to think about using Observability tools. An Observability Team can’t just go out and buy every feature at whatever vendor is in The Top Right Corner of the Gartner Magic Quadrant, and shout “Done!”.

The Observability Platform every Vendor wants you to buy

Instead, the goal should be that the benefits of Observability get proselytized within your organization. So while the high level bits that can be summarized via pre-canned views are useful and worth delivering out of the box (think: Golden Signals, Red/USE Metrics, Error Tracking, Apdex, etc), what's really important is that the ad-hoc, domain specific, exploratory tooling is as easy as possible to use. Delightful, even.

Beside the work required to build, buy, and configure these tools, Observability teams needs to teach users of these tools "how to fish" to some extent. So, a user wants to see the p95 latency of a particular subset of requests to a particular route within their web app? The Observability Team would help them construct the necessary query, and then afterward they'd add that query to documentation, annotating it with explanations of the query syntax, so others could find it. These sort of collaborative moments, alongside building good documentation, playbooks, FAQs, tutorials, and so on, often end up being a large chunk of the enablement work an Observability team does. An Observability team should do a lot of Evangelism, even if it means wearing stupid hats and shirts.

Another part of enablement is instrumentation, SDKs, and data collection. These all should be as easy to use as the UI. There should be well defined standards and semantics, along with vetted and blessed instrumentation libraries. What can be preconfigured should be already configured for the user, ideally via easy to adjust environment variables. Any additional infrastructure that needs to be deployed with the application, like a sidecar agent, should be standardized and included in OOTB templates or manifests. The goal is that all a user needs to focus on is their code, what to instrument and what metadata they'd like to collect. And when it's not easy, or when it's kludgy, or when it involves some unholy private method monkey patch, an Observability Team should be recording these use-cases that force users to color outside the lines, and determine ways to make it easier going forward. Maybe that's building internal tooling (helper libraries, mix-ins, custom pipeline processors), or buying point solutions (A user wants to have Browser Session Replay? Maybe buying Replay.io makes more sense than trying to ship your own custom browser or browser extension).

This is also where Cost control, the other responsibility of an Observability Team, starts to matter. Simply buying (or building) a tool when someone asks for it is an anti-pattern. You'll just end up with a bunch of half baked poorly maintained Observability tools instead of a cohesive platform, with tons of tribal knowledge, few standards, and no source of truth. Instead, an Observability Team needs to be able to work with the user to understand the problem they're trying to solve. It's possible that existing tools can already solve that problem using a different pattern. And if they can't, it's important to be able to recognize when solving a use cases has a relatively low return on investment. Another point solution, or an increased infrastructure footprint for a custom solution, may not be beneficial enough to the entire organization to warrant the investment.

Cost control also matters when curating the Observability experience for other engineers at a company. Curating for cost control can mean enforcing limits on the volume and verbosity of Observability data. It can mean applying sampling algorithms, or defining different retention standards, at different levels of granularity, for different types of Observability Data. It can mean metering individual teams usage over time to allocate costs, instead of treating all Observability data as one specific cost center. It can mean making sure that an Error Stack Trace isn't getting simultaneously logged, added as metadata to a Span Attribute, and shipped to a 3rd party Error Tracking SaaS, each of which adds costs, especially if only 1 of those 3 places are actually being used by other engineers to look at Error Stack Traces.