Monthly Archive

Instrument Jenkins With OpenTelemetry

Nov 27, 2025 By Anjali Udasi In Last9

You can instrument Jenkins with OpenTelemetry using the official plugin and an OpenTelemetry Collector, then send the data to a backend like Last9 to understand where pipeline latency and failures actually originate. Jenkins provides job status and console logs, but it doesn't show how time is distributed across stages, agents, plugins, and external systems. OpenTelemetry fills that gap by emitting traces, metrics, and logs in a standard format that any OTLP-compatible backend can process.

Read Post

Last9

Read more about Instrument Jenkins With OpenTelemetry

Pastries with SREs: Holding onto extra observability data and desserts

Nov 26, 2025 By Elastic In Elastic

In this episode of Pastries with SREs, we dig into why you should keep all of your observability data, even if you don’t need it quite yet. We explore: With enriched logs and flexible, cost-effective storage, you can stop worrying about what you might need later and start answering questions with confidence, no matter when they arise. Additional resources.

View Video

Elastic

Read more about Pastries with SREs: Holding onto extra observability data and desserts

How AI Agents Are Redefining the SRE Role

Nov 25, 2025 By PagerDuty In PagerDuty

Even the best site reliability engineers (SREs) spend too much time doing reactive work—triaging incidents, gathering context, escalating to the right teams, and documenting what happened. That work is essential, but it’s not where an SRE’s highest value lies. These engineers are hired to build and maintain resilient systems, not play air-traffic control with every alert that hits their queue.

Read Post

PagerDuty

Read more about How AI Agents Are Redefining the SRE Role

Lessons from KubeCon: What "Best-of-Breed" AI SRE Really Requires

Nov 24, 2025 By Ilan Adler In Komodor

This year’s KubeCon underscored a real shift: AI SRE has gone mainstream. Of course, it’s not a surprise. Teams from high-growth startups to Fortune 500s are running more complex, cloud-native systems, shipping more AI-generated code, and facing rising expectations. Downtime is absolutely not an option and the work for on-call SREs has become unsustainable. The question isn’t whether AI SRE helps. It’s which one you can trust in production.

Read Post

Komodor

Read more about Lessons from KubeCon: What "Best-of-Breed" AI SRE Really Requires

7 Observability Solutions for Full-Fidelity Telemetry

Nov 24, 2025 By Anjali Udasi In Last9

You don’t have to choose between capturing every signal and keeping costs predictable. Modern observability stacks blend full-fidelity storage (time series or columnar systems like ClickHouse and Apache Druid), tail-based sampling for heavy traffic, and tiered storage (hot/warm/cold with S3-backed archives). This gives you full-fidelity incident forensics with the day-to-day cost profile of a sampled setup.

Read Post

Last9

Read more about 7 Observability Solutions for Full-Fidelity Telemetry

Mezmo + Catchpoint deliver observability SREs can rely on

Nov 24, 2025 By Mezmo In Mezmo

For SREs juggling multiple services, third-party dependencies, and constant alerts, a critical service slowdown can quickly turn into chaos. APM Dashboards may show everything is fine, yet users are still experiencing problems. That gap—between application telemetry and real-world performance—can turn a five-minute fix into a two-hour war room. ‍

Read Post

Mezmo

Read more about Mezmo + Catchpoint deliver observability SREs can rely on

Introducing Bits AI SRE, your AI on-call teammate

Nov 24, 2025 By Datadog In Datadog

Bits AI SRE is your AI on-call teammate, built to autonomously investigate alerts and coordinate incident response. Integrated with Datadog, Slack, GitHub, Confluence, and more, Bits analyzes telemetry, reads documentation, and reviews recent deployments to determine the root cause of alerts—often before you’ve even opened your laptop. In fact, if you're using Datadog On-Call, you can view Bits’s findings right from your phone—so you’re always one step ahead, no matter where you are.

View Video

Datadog

Read more about Introducing Bits AI SRE, your AI on-call teammate

Top 7 Observability Platforms That Auto-Discover Services

Nov 21, 2025 By Anjali Udasi In Last9

You can use an observability platform that automatically discovers your services and provides ready-to-use dashboards with minimal setup. If you're running a system where microservices come and go, containers shift around, or serverless functions scale up quickly, this kind of experience saves you a lot of time. You gain visibility as soon as something goes live, without requiring any additional steps on your part. In this blog, we talk about the top seven platforms that offer these capabilities.

Read Post

Last9

Read more about Top 7 Observability Platforms That Auto-Discover Services

How to Reduce Log Data Costs Without Losing Important Signals

Nov 20, 2025 By Anjali Udasi In Last9

You can cut your log costs by removing repetitive, low-value logs early and keeping only the parts that genuinely help you understand issues. Modern systems generate logs far faster than you expect. Even when your workload stays stable, infrastructure components, retries, and background workers continue producing a steady stream of repeated entries.

Read Post

Last9

Read more about How to Reduce Log Data Costs Without Losing Important Signals

OTel Updates: Complex Attributes Now Supported Across All Signals

Nov 19, 2025 By Anjali Udasi In Last9

OpenTelemetry now supports maps, heterogeneous arrays, and byte arrays across all signals. Here’s where these new types shine — and where simple primitives still fit naturally. If you’ve been working with OpenTelemetry for a while, you’re likely familiar with the straightforward key-value approach to attributes. It’s simple, fast, and works well with how most telemetry backends store, index, and query data.

Read Post

Last9

Read more about OTel Updates: Complex Attributes Now Supported Across All Signals

What is AWS Fargate for Amazon ECS?

Nov 19, 2025 By Anjali Udasi In Last9

As cloud applications moved from VMs to containers and then to microservices, the amount of background work needed to keep everything running grew just as quickly. You gain speed and flexibility, but you also end up managing clusters, scaling rules, and capacity choices that don’t really add to the product you’re building. AWS Fargate steps in right there. It lets you run your ECS tasks without looking after any servers at all.

Read Post

Last9

Read more about What is AWS Fargate for Amazon ECS?

Pastries with SREs: FinOps is to ROI as a coffee is to cannoli

Nov 19, 2025 By Elastic In Elastic

In this episode of Pastries and SREs, our hosts tackle one of the hardest questions observability leaders face: "How do you prove the ROI of observability?" This isn’t just about uptime or dashboards. It’s also about aligning observability with business outcomes, cloud cost savings, and FinOps metrics that matter to leadership.

View Video

Elastic

Read more about Pastries with SREs: FinOps is to ROI as a coffee is to cannoli

It's Never Different This Time: LLM Reliability Without the Hype with Julien Simon

Nov 19, 2025 By Rootly In Rootly

In this episode, Julien Simon, longtime voice in the open-source ML world, reminds us that even in the era of GenAI, reliability fundamentals haven’t changed. Julien breaks down why calling “the same model” from different providers can produce wildly different results, how deployment choices introduce hidden variability, and why reliability teams need to think of LLM systems as distributed systems.

View Video

Rootly

Read more about It's Never Different This Time: LLM Reliability Without the Hype with Julien Simon

GPT-5.1 is here: does it spend less tokens? #ai #sre

Nov 18, 2025 By Rootly In Rootly

View Video

Rootly

Read more about GPT-5.1 is here: does it spend less tokens? #ai #sre

Mezmo's AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)

Nov 17, 2025 By Mezmo In Mezmo

We are thrilled to announce the availability of Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)—a truly transformative leap forward for engineering and operations teams included in your existing subscription at no additional charge. We are paving the way for a new era of observability, moving beyond passive, reactive monitoring to a world of proactive AI-driven observability.

Read Post

Mezmo

Read more about Mezmo's AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)

Top 9 Web Application Performance Monitoring Tools for 2025

Nov 17, 2025 By Anjali Udasi In Last9

You know that uneasy pause before opening your monitoring dashboard? The one where you're hoping nothing's broken—but a part of you knows something probably is. Performance issues often start quietly: a few slow endpoints, a checkout that takes longer than usual, a graph that looks a little off. Before long, those small signals turn into alerts and support tickets.

Read Post

Last9

Read more about Top 9 Web Application Performance Monitoring Tools for 2025

Build Your Kubernetes Monitoring Foundation with kube-prometheus-stack

Nov 13, 2025 By Anjali Udasi In Last9

When you run Kubernetes at scale, one of the first challenges is understanding what the cluster is actually doing. Workloads shift around, pods restart for normal reasons, and traffic doesn't always follow the patterns you expect. Having clear signals makes day-to-day operations much easier. That's where kube-prometheus-stack helps. It brings Prometheus, Grafana, Alertmanager, and supporting components together as a single package.

Read Post

Last9

Read more about Build Your Kubernetes Monitoring Foundation with kube-prometheus-stack

OTel Updates: OpenTelemetry eBPF Instrumentation (OBI) Hits Alpha

Nov 12, 2025 By Anjali Udasi In Last9

Some parts of a system don’t lend themselves to quick instrumentation changes. You might have a production binary that hasn’t been rebuilt in years, or a stack made of several languages where each team manages telemetry differently. In those situations, getting consistent signals often means touching code you’d rather leave alone or coordinating updates across many services. OpenTelemetry eBPF Instrumentation (OBI) approaches this from the kernel side.

Read Post

Last9

Read more about OTel Updates: OpenTelemetry eBPF Instrumentation (OBI) Hits Alpha

Pastries with SREs: No compromises on cost-effective observability or donuts.

Nov 12, 2025 By Elastic In Elastic

In this episode of Pastries and SREs, we dig into how vendor lock-in and sky-high observability costs are forcing teams to choose between coverage and budget, AND why you shouldn’t have to settle. With donuts in hand, we explore how to take back control of your observability strategy by making it cost-effective, comprehensive, and flexible.

View Video

Elastic

Read more about Pastries with SREs: No compromises on cost-effective observability or donuts.

OpenTelemetry Metrics in Quarkus Explained

Nov 10, 2025 By Anjali Udasi In Last9

When you run services on Quarkus, you need a steady stream of signals to understand how the application behaves—CPU trends, request timings, memory patterns, and how each endpoint responds under load. Metrics give you that visibility. They help answer questions like: OpenTelemetry fits well here because it gives Quarkus a common way to generate and export metrics without locking you into a specific monitoring tool.

Read Post

Last9

Read more about OpenTelemetry Metrics in Quarkus Explained

How to Choose an AI SRE Solution

Nov 10, 2025 By Ariel Russo In PagerDuty

The AI SRE landscape has exploded over the past year, with vendors racing to add artificial intelligence capabilities to their platforms. For engineering leaders evaluating these solutions, the sheer number of options can feel overwhelming. Some vendors are building AI-native solutions from scratch, while others are retrofitting AI onto existing workflows. Cloud providers are embedding agents into their ecosystems, and observability platforms are adding intelligence layers to their telemetry data.

Read Post

PagerDuty

Read more about How to Choose an AI SRE Solution

How Rootly works with Slack | An end-to-end demo.

Nov 9, 2025 By Rootly In Rootly

Rootly is the AI-native on-call and incident management platform that helps you resolve incidents faster, improve system resilience, and streamline on-call operations. It’s your always-on SRE copilot that automates root cause analysis and identifies patterns that drive continuous improvement—trusted by thousands of companies like LinkedIn, NVIDIA, Replit, Elastic, Canva, Clay, Tripadvisor, and Grammarly.

View Video

Rootly

Read more about How Rootly works with Slack | An end-to-end demo.

How Prometheus Exporters Work With OpenTelemetry

Nov 6, 2025 By Anjali Udasi In Last9

Running distributed systems means you need clear visibility into how your services behave. Prometheus has been the standard for metrics for a long time, and OpenTelemetry is now giving teams a more consistent way to collect telemetry across their stack. In many setups, you'll have both: existing Prometheus instrumentation that's already in place, and new components instrumented with OpenTelemetry.

Read Post

Last9

Read more about How Prometheus Exporters Work With OpenTelemetry

Bits AI SRE, Flex Frozen, and GPU Monitoring | DASH 2025

Nov 6, 2025 By Datadog In Datadog

Get a first look at Datadog’s biggest product reveals from DASH 2025. Meet Bits AI SRE, your 24/7 autonomous AI Site Reliability Engineer, Flex Frozen for up to 7 years of managed log retention, and GPU Monitoring for full visibility into your AI workloads. Experience the future of observability in action.

View Video

Datadog

Read more about Bits AI SRE, Flex Frozen, and GPU Monitoring | DASH 2025

What Are AI Guardrails

Nov 5, 2025 By Anjali Udasi In Last9

When you're shipping LLM features, a lot of the work goes into keeping the model's behavior predictable. You deal with questions like: These are everyday concerns when you integrate LLMs into production systems. Guardrails AI provides a Python framework that helps you enforce those expectations. You define the schema or constraints you need, and the framework validates both the inputs going into the model and the outputs coming back.

Read Post

Last9

Read more about What Are AI Guardrails

Pastries with SREs: From AIOps to GenAI and LLMs (lactose-free latte making)

Nov 5, 2025 By Elastic In Elastic

In this episode of Pastries with SREs, we look at AIOps, where it fell short, where it worked, and how generative AI (GenAI) is reshaping what’s possible in observability today. We explore: If you’re wondering whether generative AI is different this time, this episode offers a grounded, practical look at how it’s evolving observability workflows.

View Video

Elastic

Read more about Pastries with SREs: From AIOps to GenAI and LLMs (lactose-free latte making)

You Can't Fix What You Don't Measure: Observability in the Age of AI with Conor Bronsdon

Nov 5, 2025 By Rootly In Rootly

Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to unpack why observability, monitoring, and human feedback are the missing links to make large language model (LLM) reliable in production.

View Video

Rootly

Read more about You Can't Fix What You Don't Measure: Observability in the Age of AI with Conor Bronsdon

Grafana Tempo: Setup, Configuration, and Best Practices

Nov 4, 2025 By Anjali Udasi In Last9

As systems grow, understanding how a request moves across multiple services becomes harder. Traces help bring this picture together by showing the exact path a request takes, along with the timings that matter. Grafana Tempo is built for this kind of workload. It stores traces efficiently, works well with OpenTelemetry, and keeps the operational overhead low.

Read Post

Last9

Read more about Grafana Tempo: Setup, Configuration, and Best Practices

SRE vs DevOps vs Platform Engineering: What Are the Key Differences

Nov 4, 2025 By Randhir Kumar In Spike

Software delivery is more complex than ever. Teams need speed, reliability, and scalability to stay competitive. Site Reliability Engineering (SRE), DevOps, and Platform Engineering are three key disciplines that address these challenges. Though these terms are often used together, they are not the same and share distinct differences. In this blog, we’ll discuss each term individually, compare SRE vs. DevOps vs. Platform Engineering, and also show how they work together.

Read Post

Spike

Read more about SRE vs DevOps vs Platform Engineering: What Are the Key Differences

OTel Updates: Declarative Config - A Steadier Way to Configure OpenTelemetry SDKs

Nov 3, 2025 By Anjali Udasi In Last9

Application configs change over time, often in small ways that are easy to miss. They may start simple — a few environment variables, one exporter, nothing unexpected. As your instrumentation grows, you add rules for filtering health check spans, adjust sampling based on attributes, or introduce environment-specific resource settings. Each change makes sense on its own. But months later, the picture can look different across dev, staging, and production.

Read Post

Last9

Read more about OTel Updates: Declarative Config - A Steadier Way to Configure OpenTelemetry SDKs

Embracing failure and chaos to improve system reliability and SRE team performance

Nov 3, 2025 By Elastic In Elastic

In this interview with Alex Hidalgo, Field CTO at Nobl9 and author of Implementing Service Level Objectives (O’Reilly Media), we explore how traditional metrics like MTTR and MTTx can give a false sense of reliability. Alex shares how SRE teams can embrace failure, build psychological safety, and design systems that reflect the human factor behind uptime, outages, and real-world reliability.

View Video

Elastic

Read more about Embracing failure and chaos to improve system reliability and SRE team performance

Operations | Monitoring | ITSM | DevOps | Cloud

Instrument Jenkins With OpenTelemetry

Pastries with SREs: Holding onto extra observability data and desserts

How AI Agents Are Redefining the SRE Role

Lessons from KubeCon: What "Best-of-Breed" AI SRE Really Requires

7 Observability Solutions for Full-Fidelity Telemetry

Mezmo + Catchpoint deliver observability SREs can rely on

Introducing Bits AI SRE, your AI on-call teammate

Top 7 Observability Platforms That Auto-Discover Services

How to Reduce Log Data Costs Without Losing Important Signals

OTel Updates: Complex Attributes Now Supported Across All Signals

What is AWS Fargate for Amazon ECS?

Pastries with SREs: FinOps is to ROI as a coffee is to cannoli

It's Never Different This Time: LLM Reliability Without the Hype with Julien Simon

GPT-5.1 is here: does it spend less tokens? #ai #sre

Mezmo's AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)

Top 9 Web Application Performance Monitoring Tools for 2025

Build Your Kubernetes Monitoring Foundation with kube-prometheus-stack

OTel Updates: OpenTelemetry eBPF Instrumentation (OBI) Hits Alpha

Pastries with SREs: No compromises on cost-effective observability or donuts.

OpenTelemetry Metrics in Quarkus Explained

How to Choose an AI SRE Solution

How Rootly works with Slack | An end-to-end demo.

How Prometheus Exporters Work With OpenTelemetry

Bits AI SRE, Flex Frozen, and GPU Monitoring | DASH 2025

What Are AI Guardrails

Pastries with SREs: From AIOps to GenAI and LLMs (lactose-free latte making)

You Can't Fix What You Don't Measure: Observability in the Age of AI with Conor Bronsdon

Grafana Tempo: Setup, Configuration, and Best Practices

SRE vs DevOps vs Platform Engineering: What Are the Key Differences

OTel Updates: Declarative Config - A Steadier Way to Configure OpenTelemetry SDKs

Embracing failure and chaos to improve system reliability and SRE team performance

Monthly Archive

Follow Us