The Rise of Observability (o11y): Neanderthals with logs to Astronauts with Instrument Panels

Published: Jun 22, 2022
By: Victor Adossi, EIR

TL;DR: Observability (“o11y”) has undergone a nearly 4 decade transition: from Apache/NGINX logs, to Nagios, Zipkin/Jaeger/Prometheus, and finally self-serve platforms powered by Kubernetes.

In the beginning, there were logs

Ignoring the punch card era and some other intermediate technology, the invention of the teletype brought the most comfortable debugging technique for developers – logging.

Logs are arguably the first pillar of observability – for a long time (and even now) the easiest/best way to check the health of applications was a quick look or programmatic scan of Apache, NGINX, or application logs.

Application level logs (and tools like logrotate) helped people investigate issues and debug just as they do today.

Post Y2K, metrics with Nagios

Logs are great, but sometimes being able to track metrics is better.

Underlying infrastructure like routers, switches, and hard drives are easier to monitor with access to discrete measurements that represent usage, contention, and other constraints.

After working on keeping Novell Netware servers up, Ethan Galstad wrote NetSaint Nagios. The history of Nagios is fascinating but more importantly it captures a shift in the evolution of observability – people start to widely rely on pull based metrics, and build in pluggability to ensure many different reusable checks can be performed.

Nagios snapshot in 2006

Nagios in 2006

Nagios network overview page

Nagios network overview page

While Nagios seems quaint now, it represented a new era for observability, and was a vital tool for many system administrators.

In 2008, New Relic makes things easier with the power of SaaS

Maintaining large metrics ingestion, analysis, and graphic systems is hard.

Like any good Software as a Service (“SaaS”) project, New Relic made it easy for companies to focus on their core competencies and reduce the maintenance burden for monitoring.

Seeing pervasive use around the Ruby on Rails and Heroku developer communities, New Relic quickly became the gold standard for entry level and mid-tier observability for developers.

New Relic made it easy for developers to focus on the “golden signals” that correlated most to app health and user experience.

NewRelic.com in 2008

NewRelic.com in 2008

An Aside: Which metrics are “golden”?

These days we commonly call them “golden signals”, but New Relic was innovative in surfacing only the most important metrics, pushing users to conserve attention for the groupings of metrics that really matter.

One grouping is called the RED method:

  • (Request) Rate - How many requests per second are being performed?
  • # of Errors - How many errors are occurring?
  • (Request) Duration - How long do requests take?

These days we also have the USE method, popularized by Brendan Gregg:

  • Utilization - How much is the resource being used?
  • Saturation - Does this resource have extra work it can’t perform that is piling up?
  • Errors - How many errors are occurring?

In general, RED is a little better suited to applications and USE is better suited to lower-level components and underlying infrastructure. Note that regardless of what you’re running, errors are quite important to track.

2010: Tracing, the rise of Dapper, Zipkin and Jaeger

Logs and metrics are great, but for tracking down defects, knowing the precise flow of an operation or request is a game changer.

Google was amongst the first to openly discuss their approach to tracing, publishing a paper on an internal system called Dapper in 2010. Zipkin was developed by Twitter in 2012, which influenced the development of Jaeger at Uber in 2015 .

Such began the era tracing for observability, with Dapper, Zipkin and Jaeger.

Adopting tracing in a codebase introduced people to the fundamental tension of observability and instrumentation. Much of the time, good observability into a system isn’t free – systems must be built to be observable.

2012: The Prometheus revolution

At this point Nagios had lived long enough to become the villain, and Prometheus burst onto the scene in 2012 as a simpler, scalable and more efficient alternative to Nagios.

Prometheus reintroduced the world to pull-based metrics, innovated the Prometheus Exposition Format (which is now part of the OpenMetrics standard) and became a widely adopted piece of the cloud native landscape.

2014: A single pane of glass, with Grafana

Logs, metrics and traces are useless without an ergonomic looking glass (no, not that Looking Glass).

Grafana appeared in 2014 to save companies from building ad hoc visualizations or generating gnuplots or SVGs. Observability data finally became much easier to observe.

Many companies built bespoke dashboards until Grafana came onto the scene and offered people an open source solution for the most common visualizations and charts.

Metrics, logs, traces and more – Datadog turns it to 11

With all the good tools out there, why build another one?

Datadog is one of the leaders in the observability space because it managed to centralize observability data (metrics, logs, and traces) and made it nearly effortless to collect, analyze and display information.

Logs, metrics, traces, user session data and more can all be fed into Datadog and used easily with client libraries and responsive web-based analysis and alerting.

Companies like Honeycomb, LightStep and others are also blazing their own trails, and today teams are spoiled for choice on innovative observability platforms.

What would have taken an experienced team years to get in place can be set up in seconds with these SaaS solutions.

The unevenly distributed future: Observability “at home” with Kubernetes

Entrusting your observability needs to vendors brings ease and efficiency, but many high-performing organizations need observability at home in their own data centers, often a self-service environment where developers are empowered to act autonomously.

As containers did their part in eating the world, Kubernetes became the leader in container orchestration software.

As the saying goes, the future is already here – it’s just unevenly distributed.

While many companies entrust their data to Honeycomb or Datadog, a growing sector of companies choose to run their own observability platforms internally.

Running Prometheus, Zipkin/Jaeger, Grafana and ElasticSearch and other observability tools easily per-team and across internal organizations is a reality for some but not all teams today, and the trend is growing.

Easing the platform engineering burden

That said, managing Kubernetes or any other robust container orchestration platform is not for the faint of heart. Considerable expertise is required to maintain robust and resilient platforms.

Despite all the value that tools like Sysdig provide, integration with on-premise systems can be difficult and require careful integration.

For service providers, companies like Replicated have popped up to make it easy for businesses to deploy production-grade observability tooling “at home” in their data centers.

The dominance of the Operator pattern - CRDs and Controllers

One of the key innovations in Kubernetes is the Operator pattern. While the operator pattern is native to Kubernetes, CoreOS was the first to coin the term, and the rest is history.

The Custom Resource Definition (“CRD”) enables Kubernetes to extend infinitely – covering new use cases, integrations, and reducing toil for cluster operators.

The operator pattern is simple, yet extremely effective:

  • CRDs define resource/objects/items of interest
  • Controllers monitor system conditions and relevant CRDs, enforcing expected state and reacting to changes

This is the key to the self-healing ability of Kubernetes and other advanced systems – and the key to massive scale without proportional toil for platform engineers providing observability.

What’s next?

The story of observability is far from over, so here at OCV we’re asking ourselves the age- old question – “What’s next?”

Pain free observability

Projects like Pixie are pushing the boundaries on what adding observability can look like. One of the biggest benefits of tools like Pixie is the lack of setup required.

Since OpenCensus merged with OpenTracing to produce OpenTelemetry, the vast amount of SDKs and off the shelf components offered by OpenTelementry has made it drastically easier to add Observability to any project

Will a day come where observability for most languages is completely automatic?

High Performance observability

eBPF has been a transformative addition to the Linux kernel, enabling observability of low level kernel constructs and operations.

With eBPF, administrators can collect metrics from system calls, network equipment, drivers and at a speed that has been hard or impossible to achieve before. In the Kubernetes space, Cilium represents a networking stack that is built on the power of eBPF.

Correlation across observability tools

Correlation separates data from noise. As we build and use more and more tools to be better informed about our systems, correlating what we find and measure becomes of utmost importance.

Lots of teams are working on this:

Observability is a flywheel that is still spinning – higher level analysis across tooling, services and platforms is still yet to be built.

Anomaly Detection (+/- ML/AI)

Every step forward in anomaly detection is a step away from error-prone manual checking and monitoring of dashboards.

Solutions like HoneyComb’s BubbleUp and others are building a world where developers and administrators only get paged when something is really wrong, and systems can more easily self-heal.

It’s a bit of a “holy grail” problem, but projects like Kubernetes that have embraced reconciliation loops have pushed the field forward. Extremely accurate detection and remediation is next.

Instant, go-everywhere debugging

What if you didn’t have to just observe the application – what if you could debug the application live, or even run a local instance of your application?

Platforms like Kubernetes now have sophisticated ways of debugging running pods, and companies like Telepresence are pushing the boundaries and blurring the lines of “remote” and “local”.