Sri Chaganty

“Observability” has become a key trend in Service Reliability Engineering practice.  One of the recommendations from Gartner’s latest Market Guide for IT Infrastructure Monitoring Tools released in January 2020 says, “Contextualize data that ITIM tools collect from highly modular IT architectures by using AIOps to manage other sources, such as observability metrics from cloud-native monitoring tools.”

Like so many other terms in software engineering, ‘observability’ is a term borrowed from an older physical discipline: in this case, control systems engineering. Let me use the definition of observability from control theory in Wikipedia: “observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.”

Observability is gaining attention in the software world because of its effectiveness at enabling engineers to deliver excellent customer experiences with software despite the complexity of the modern digital enterprise.

When we blew up the monolith into many services, we lost the ability to step through our code with a debugger: it now hops the network.  Monitoring tools are still coming to grips with this seismic shift.

How is observability different than monitoring?

Monitoring requires you to know what you care about before you know you care about it. Observability allows you to understand your entire system and how it fits together, and then use that information to discover what specifically you should care about when it’s most important.

Monitoring requires you to already know what normal is. Observability allows discovery of different types of ‘normal’ by looking at how the system behaves, over time, in different circumstances.

Monitoring asks the same questions over and over again. Is the CPU usage under 80%? Is memory usage under 75% percent? Or, is the latency under 500ms? This is valuable information, but monitoring is useful for known problems.

Observability, on the other side, is about asking different questions almost all the time. You discover new things.

Observability allows the discovery of different types of ‘normal’ by looking at behavior, over time, in different circumstances.

Metrics do not equal observability.

What Questions Can Observability Answer?

Below are sample questions that can be addressed by an effective observability solution:

  • Why is x broken?
  • What services does my service depend on — and what services are dependent on my service?
  • Why has performance degraded over the past quarter?
  • What changed? Why?
  • What logs should we look at right now?
  • What is system performance like for our most important customers?”
  • What SLO should we set?
  • Are we out of SLO?
  • What did my service look like at time point x?
  • What was the relationship between my service and x at time point y?
  • What was the relationship of attributed across the system before we deployed? What’s it like now?
  • What is most likely contributing to latency right now? What is most likely not?
  • Are these performance optimizations on the critical path?

About the Author –

Sri is a Serial Entrepreneur with over 30 years’ experience delivering creative, client-centric, value-driven solutions for bootstrapped and venture-backed startups.