SRE

The latest News and Information on Service Reliability Engineering and related technologies.

The unresolved cost of High Cardinality

Dec 15, 2023 By Prathamesh Sonpatki In Last9

Fulfill all your food delivery orders this December 31st by taming High Cardinality data 😉

Read Post

Last9

Read more about The unresolved cost of High Cardinality

Lessons in Incident Response I Learned While Waiting Tables

Dec 13, 2023 By Ashley Sawatsky In Rootly

Before I stumbled into the tech industry (a story for another day), I spent several years in the customer service world as a server and front-of-house manager in restaurants. It was in these jobs that I first honed some critical skills that would later lead me on the path to incident response.

Read Post

Rootly

Read more about Lessons in Incident Response I Learned While Waiting Tables

Prometheus Metrics Types - A Deep Dive

Dec 13, 2023 By Tripad Mishra In Last9

A deep dive on different metric types in Prometheus and best practices.

Read Post

Last9

Read more about Prometheus Metrics Types - A Deep Dive

Incident vs Bug: Understanding the Key Differences

Dec 12, 2023 By Anjali Udasi In Zenduty

Incidents and bugs are two common occurrences that can disrupt the smooth operation of systems and applications. While these terms may seem similar, they represent distinct concepts with different implications. Understanding the nuances between incidents and bugs is crucial for effective incident management and proactive problem resolution.

Read Post

Zenduty

Read more about Incident vs Bug: Understanding the Key Differences

Comparing Uptime Monitoring, Heartbeat Monitoring, and Synthetic Monitoring

Dec 8, 2023 By Chitra Bisht In Squadcast

In the quest for a high-velocity development environment, one fundamental question looms large: "How can you ensure an exceptional end-user experience when an array of engineers continually push and deploy code?" The unequivocal answer to this pivotal inquiry lies in the establishment of robust, straightforward, and well-defined monitoring practices.

Read Post

Squadcast

Read more about Comparing Uptime Monitoring, Heartbeat Monitoring, and Synthetic Monitoring

Monitor Cloudflare Workers using Prometheus Exporter

Dec 8, 2023 By Aniket Rao In Last9

Complete guide to monitor Cloudflare workers using Prometheus Exporter.

Read Post

Last9

Read more about Monitor Cloudflare Workers using Prometheus Exporter

IT Automation Powers SRE Practices as System Complexity, Consumer Demands Grow

Dec 8, 2023 By John Gorham In Resolve

Site Reliability Engineers (SREs) use automation and orchestration capabilities to scale security and performance, ensuring sites are reliable and efficient. Site Reliability Engineering (SRE) can be applied to a wide range of use cases and industries, where software systems and services are critical to business operations.

Read Post

Resolve

Read more about IT Automation Powers SRE Practices as System Complexity, Consumer Demands Grow

Autocorrelate Alerts With Squadcast's Key-Based Deduplication

Dec 7, 2023 By Chitra Bisht In Squadcast

With the increasing complexity of technology stacks and monitoring tools, managing incidents can become overwhelming, leading to alert noise, alert fatigue, and delayed responses. This is where Key-Based Deduplication comes to the rescue, streamlining incident handling and enhancing the effectiveness of your Incident Management platform.

Read Post

Squadcast

Read more about Autocorrelate Alerts With Squadcast's Key-Based Deduplication

Why you need a Time Series Data Warehouse

Dec 7, 2023 By Rishi Agrawal In Last9

What is a Time Series Data Warehouse? How does it help in your monitoring journey? How does it differ from a Time Series Database? That and more.

Read Post

Last9

Read more about Why you need a Time Series Data Warehouse

When More Incident Commanders are Better

Dec 6, 2023 By Strong Liang In Rootly

It has been lightly revised and reposted with his permission from the original article on Medium. Leading major incident responses can be extremely stressful. You have to quickly gather an ad-hoc team, figure out what went wrong, identify a fix and make sure this doesn't make things worse, all the while with senior leadership breathing down your neck. Are we having fun yet? Many people think having a dedicated incident commander role will solve the problem.

Read Post