Commentary: Those who say observability killed monitoring aren’t paying attention. Here’s why.
You can be forgiven if you thought monitoring was passé. Nagios, for example, is probably the best known of the open source monitoring tools, but interest in it has steadily declined for over a decade. Meanwhile, observability tools like OpenTelemetry are hot, though “observability” is arguably a cool new term for much the same metrics, logs, and traces that we’ve been analyzing long before the term was coined.
Indeed, as Lightstep CEO Ben Sigelman has argued, observability isn’t going to replace monitoring “because it shouldn’t.” Observability is all about augmenting monitoring, not replacing it. Here’s why.
SEE: Editorial calendar: IT policies, checklists, toolkits, and research for download (TechRepublic Premium)
Thinking differently about monitoring
I suggested above that observability is really just a fancy way of saying “logs, traces, and metrics,” but that’s overly simplistic. Ultimately, according to Sigelman, observability is about telemetry and storage. Telemetry increasingly is synonymous with OpenTelemetry, the CNCF-hosted open source project. And storage? It’s more than a time series database or a database for storing logs, traces, and transactions. You need both.
The third thing Sigelman insists upon brings us back to monitoring: The health of the system (i.e., monitoring) and understanding change within those systems (i.e., statistical insights buried in all that telemetry data). Sounds important, right? That’s because it is. As Sigelman went on to explain, monitoring really means “an effort to connect the health of a system component to the health of the business.” That’s always going to be a good idea, and feeds into things like more modern approaches to service-level agreements (SLAs) like service-level objectives (SLOs), an approach that Google has helped to popularize.
So why is monitoring suddenly un-cool? Sigelman suggested:
“Monitoring” got a bad name because operators were trying to monitor every possible failure mode of a distributed system. That doesn’t work because there are too many of them. (And that’s why you have too many dashboards at your company.)
Dashboards are nice, but they can also confuse as much as they clarify by bombarding operators with too much data. If we dig into SLOs, however, Sigelman argued that they can help monitoring to evolve beyond noisy dashboards to SLOs that help us gauge the changes in signals that help us track system health.
These SLOs, which set a numerical target for system availability, act as the “peripheral nervous system” of observability, said Sigelman. Rather than relying on a human staring at a dashboard (or, more likely, an array of dashboards) and hoping she can cognitively decipher what’s happening at a glance, the SLO approach instruments things in a way that allows humans before and after the fact to dial up (more cost to operate) or down (lowers costs and increases development velocity) reliability.
So is monitoring dead? Nope. Not even close. Perhaps the way we used to conceive of monitoring is due to be retired, but the practice of monitoring has never been more important. It’s a central component of observability, and likely will be for years to come.
Disclosure: I work for AWS, but the views expressed herein are mine.