TL;DR: Observability (“o11y”) has undergone a nearly 4 decade transition: from Apache/NGINX logs, to Nagios, Zipkin/Jaeger/Prometheus, and finally self-serve platforms powered by Kubernetes.
Ignoring the punch card era and some other intermediate technology, the invention of the teletype brought the most comfortable debugging technique for developers – logging.
Logs are arguably the first pillar of observability – for a long time (and even now) the easiest/best way to check the health of applications was a quick look or programmatic scan of Apache, NGINX, or application logs.
Application level logs (and tools like
logrotate) helped people investigate issues and debug just as they do today.
Logs are great, but sometimes being able to track metrics is better.
Underlying infrastructure like routers, switches, and hard drives are easier to monitor with access to discrete measurements that represent usage, contention, and other constraints.
After working on keeping Novell Netware servers up, Ethan Galstad wrote
NetSaint Nagios. The history of Nagios is fascinating but more importantly it captures a shift in the evolution of observability – people start to widely rely on pull based metrics, and build in pluggability to ensure many different reusable checks can be performed.
Nagios in 2006
Nagios network overview page
While Nagios seems quaint now, it represented a new era for observability, and was a vital tool for many system administrators.
Maintaining large metrics ingestion, analysis, and graphic systems is hard.
Like any good Software as a Service (“SaaS”) project, New Relic made it easy for companies to focus on their core competencies and reduce the maintenance burden for monitoring.
New Relic made it easy for developers to focus on the “golden signals” that correlated most to app health and user experience.
NewRelic.com in 2008
These days we commonly call them “golden signals”, but New Relic was innovative in surfacing only the most important metrics, pushing users to conserve attention for the groupings of metrics that really matter.
One grouping is called the RED method:
In general, RED is a little better suited to applications and USE is better suited to lower-level components and underlying infrastructure. Note that regardless of what you’re running, errors are quite important to track.
Logs and metrics are great, but for tracking down defects, knowing the precise flow of an operation or request is a game changer.
Google was amongst the first to openly discuss their approach to tracing, publishing a paper on an internal system called Dapper in 2010. Zipkin was developed by Twitter in 2012, which influenced the development of Jaeger at Uber in 2015 .
Adopting tracing in a codebase introduced people to the fundamental tension of observability and instrumentation. Much of the time, good observability into a system isn’t free – systems must be built to be observable.
At this point Nagios had lived long enough to become the villain, and Prometheus burst onto the scene in 2012 as a simpler, scalable and more efficient alternative to Nagios.
Prometheus reintroduced the world to pull-based metrics, innovated the Prometheus Exposition Format (which is now part of the OpenMetrics standard) and became a widely adopted piece of the cloud native landscape.
Logs, metrics and traces are useless without an ergonomic looking glass (no, not that Looking Glass).
Grafana appeared in 2014 to save companies from building ad hoc visualizations or generating
gnuplots or SVGs. Observability data finally became much easier to observe.
Many companies built bespoke dashboards until Grafana came onto the scene and offered people an open source solution for the most common visualizations and charts.
With all the good tools out there, why build another one?
Datadog is one of the leaders in the observability space because it managed to centralize observability data (metrics, logs, and traces) and made it nearly effortless to collect, analyze and display information.
Logs, metrics, traces, user session data and more can all be fed into Datadog and used easily with client libraries and responsive web-based analysis and alerting.
What would have taken an experienced team years to get in place can be set up in seconds with these SaaS solutions.
Entrusting your observability needs to vendors brings ease and efficiency, but many high-performing organizations need observability at home in their own data centers, often a self-service environment where developers are empowered to act autonomously.
As containers did their part in eating the world, Kubernetes became the leader in container orchestration software.
As the saying goes, the future is already here – it’s just unevenly distributed.
While many companies entrust their data to Honeycomb or Datadog, a growing sector of companies choose to run their own observability platforms internally.
Running Prometheus, Zipkin/Jaeger, Grafana and ElasticSearch and other observability tools easily per-team and across internal organizations is a reality for some but not all teams today, and the trend is growing.
That said, managing Kubernetes or any other robust container orchestration platform is not for the faint of heart. Considerable expertise is required to maintain robust and resilient platforms.
Despite all the value that tools like Sysdig provide, integration with on-premise systems can be difficult and require careful integration.
For service providers, companies like Replicated have popped up to make it easy for businesses to deploy production-grade observability tooling “at home” in their data centers.
The Custom Resource Definition (“CRD”) enables Kubernetes to extend infinitely – covering new use cases, integrations, and reducing toil for cluster operators.
The operator pattern is simple, yet extremely effective:
This is the key to the self-healing ability of Kubernetes and other advanced systems – and the key to massive scale without proportional toil for platform engineers providing observability.
The story of observability is far from over, so here at OCV we’re asking ourselves the age- old question – “What’s next?”
Projects like Pixie are pushing the boundaries on what adding observability can look like. One of the biggest benefits of tools like Pixie is the lack of setup required.
Since OpenCensus merged with OpenTracing to produce OpenTelemetry, the vast amount of SDKs and off the shelf components offered by OpenTelementry has made it drastically easier to add Observability to any project
Will a day come where observability for most languages is completely automatic?
eBPF has been a transformative addition to the Linux kernel, enabling observability of low level kernel constructs and operations.
With eBPF, administrators can collect metrics from system calls, network equipment, drivers and at a speed that has been hard or impossible to achieve before. In the Kubernetes space, Cilium represents a networking stack that is built on the power of eBPF.
Correlation separates data from noise. As we build and use more and more tools to be better informed about our systems, correlating what we find and measure becomes of utmost importance.
Lots of teams are working on this:
Observability is a flywheel that is still spinning – higher level analysis across tooling, services and platforms is still yet to be built.
Every step forward in anomaly detection is a step away from error-prone manual checking and monitoring of dashboards.
Solutions like HoneyComb’s BubbleUp and others are building a world where developers and administrators only get paged when something is really wrong, and systems can more easily self-heal.
It’s a bit of a “holy grail” problem, but projects like Kubernetes that have embraced reconciliation loops have pushed the field forward. Extremely accurate detection and remediation is next.
What if you didn’t have to just observe the application – what if you could debug the application live, or even run a local instance of your application?