The latest News and Information on Observabilty for complex systems and related technologies.


Tale of the Beagle (Or It Doesn't Scale-Except When It Does)

If there’s one thing folks working in internet services love saying, it’s: "Yeah, sure, but that won’t scale." It’s an easy complaint to make, but in this post, we’ll walk through building a service using an approach that doesn’t scale in order to learn more about the problem. (And in the process, discovering that it actually did scale much longer than one would expect.)


Meet Thundra Foresight: Your CI Observability Tool!

Over the past three years, we have served thousands of developers with our two major products, Thundra APM and Thundra Sidekick – and it still feels like we’re just getting started. We would like to thank all of our users and supporters who gave us the strength to build our one-of-a-kind products. And we are very excited to announce our latest innovation: Thundra Foresight!

Observability at Microsoft: Blue Screen of Death to OpenTelemetry

Ted Young discusses OpenTelemetry at Microsoft with Reiley Yang. Reiley is a Principal Software Engineering Manager at Microsoft and a core contributor to OpenTelemetry. Lightstep’s observability platform is the easiest way for developers and SREs to monitor health and respond to changes in cloud-native applications. Powered by cutting-edge distributed tracing and a groundbreaking metrics database, and built by the team that launched observability at Google, Lightstep’s Change Intelligence provides actionable insights to help teams answer the question “What caused that change?”

PD Summit21: Transforming Infrastructure Teams Through Observability

What is this ""observability"" thing that everyone is talking about? Observability allows you to navigate the dark unknowns with echolocation while others attempt to fly blindly without it. Are your dashboards all green, but you still have an issue brewing? Do you need instant feedback based on the Core Analysis loop? Are your engineers tired of waking up at 3 AM for the expected issues? Is there a lack of time for experimentation? Generate your own answers and create a meaningful course of action with observability.

Establishing a Culture of Observability at Vanguard

Rich Anakor, chief solutions architect at Vanguard, is on a small team with a big goal: Give Vanguard customers a better experience by enabling internal engineering teams to better understand their massively complex production environment—and to do that quickly across the entire organization, in the notoriously slow-moving financial services industry. They also had a big problem: The production environment itself.

Detect any issue with Splunk APM before it turns into a customer problem

With 100% of spans and traces captured, Splunk APM meets any necessary business KPI’s and SLO metrics while investigating and troubleshooting transaction errors related to a backend application. Easily construct error budgets that measure performance of services today - learn how with this free trial Splunk Observability Cloud.

PD Summit21: MUX: Video Observability: Operational Alerting for Responding to Issues In Real-time

Streaming video accounts for the majority of internet traffic and your applications and infrastructure almost certainly include video. Mux Data allows you to easily monitor the real-time quality of experience delivered to your video viewers and integrating with PagerDuty you can automate a response and reduce the time to resolution when something goes wrong. We will cover the basics of video monitoring and how integrating with PagerDuty can ensure a great experience for viewers.

What's the Difference between Observability and Monitoring?

Wondering what the difference is between observability and monitoring? In this post, we explain how they are related, why they are important, and some suggested tools that can help. The difference between observability and monitoring is that observability is the ability to understand a system’s state from its outputs, often referred to as understanding the “unknown unknowns”.

SRE's Guide to Chaos & Observability

Today’s distributed, cloud-based environments are incredibly complex. Not only does each component depend on many others, but modern systems are also highly dynamic—changing frequently as teams push new code or make updates to infrastructure. Taming this complexity to ensure reliability requires end-to-end observability to understand how components depend on each other. Additionally, proactive Chaos Engineering combined with AI-driven observability lets you uncover “unknown unknowns” that impact how your system will respond to different failure scenarios.

Observability with Zero Code Instrumentation? Meet eBPF

Current observability practice is largely based on manual instrumentation, which requires adding code in relevant points in the user’s business logic code to generate telemetry data. This can become quite burdensome and create a barrier to entry for many wishing to implement observability in their environment. This is especially true in Kubernetes environments and microservices architecture.