Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Alerting, Incident Management and the SDLC | Better Incidents Podcast Ep. 8

Oct 5, 2023 By FireHydrant In FireHydrant

In this episode we chat with veteran cloud architect Masaru Hoshi about the challenges of alert fatigue, the importance of effective alerting systems, and fostering ownership in software teams. Masaru shares insights from his 30-year career, emphasizing the need for balance, trust, and collaboration in incident response.

View Video

FireHydrant

Read more about Alerting, Incident Management and the SDLC | Better Incidents Podcast Ep. 8

Global Event Rulesets: Streamlining Alert Routing Across Services

Oct 4, 2023 By Vishal Padghan In Squadcast

In the fast-paced world of organizations handling numerous microservices and projects, tackling the challenges that arise can be a daunting task. As many of our customers come with infrastructures that included a large number of microservices we set out to make it easier for them to streamline alert source management. Enter Global Event Rulesets (GER). This feature is designed to redefine the way you manage alerts.

Read Post

Squadcast

Read more about Global Event Rulesets: Streamlining Alert Routing Across Services

The Link Between Early Detection and Internet Resilience: A Lesson from Salesforce's Outage

Oct 4, 2023 By Madan Gopal N In Catchpoint

Almost every study examining the hourly cost of outages invariably leads to a clear and undeniable conclusion: outages are expensive. According to a 2016 study, the average cost of downtime was estimated at approximately $9,000 per minute. In a more recent study, 61% of respondents stated that outages cost them at least $100,000, with 32% indicating costs of at least $500,000 and 21% reporting expenses of at least $1 million per hour of downtime.

Read Post

Catchpoint

Read more about The Link Between Early Detection and Internet Resilience: A Lesson from Salesforce's Outage

Practicing SDLC the right way #shorts #incidentresponse #sre #softwareengineer

Oct 4, 2023 By FireHydrant In FireHydrant

View Video

FireHydrant

Read more about Practicing SDLC the right way #shorts #incidentresponse #sre #softwareengineer

The problem with noise in Alerting #shorts #incidentresponse #sre #softwareengineer

Oct 4, 2023 By FireHydrant In FireHydrant

View Video

FireHydrant

Read more about The problem with noise in Alerting #shorts #incidentresponse #sre #softwareengineer

Whose fault was it anyway? On blameless post-mortems

Oct 4, 2023 By incident.io In Incident.io

No one wants to be on the receiving end of the blame game—especially in the wake of a major incident. Sure, you know you were the one who made the final change that caused the incident. And hopefully, it was a small one that didn’t cause any SEV-1s. Still, the weight of knowing you caused something bad should be enough, right? Unfortunately, sometimes fingers get pointed, your name gets called, and suddenly, everyone knows that you’re the person who created more work for everyone.

Read Post

Incident.io

Read more about Whose fault was it anyway? On blameless post-mortems

Choosing the Right Metrics for Noiseless K8s Alerting

Oct 4, 2023 By Zenduty In Zenduty

Watch Ankur Rawal and Dheeraj Reddy talk about how to choose the right metrics for noise K8s alerting, with insights and suggestions based on the mistakes made by hundreds of companies while implementing Prometheus Alertmanager in their production systems, and learn how much bad monitoring could be costing you. This talk was delivered at PromCon'2023 in Berlin.

View Video

Zenduty

Read more about Choosing the Right Metrics for Noiseless K8s Alerting

Blameless Introduces The First Generative AI-powered, Automated Incident Communications With Comms Assistant

Oct 3, 2023 By Blameless In Blameless

Revolutionizing Incident Communications, Blameless Introduces Generative AI To More Fully Automate Incident Communication Workflows.

Read Post

Blameless

Read more about Blameless Introduces The First Generative AI-powered, Automated Incident Communications With Comms Assistant

What Is the Role of an Incident Commander?

Oct 3, 2023 By Eduardo Messuti In Statuspal

For most businesses, managing major incidents can be intimidating. With a swarm of information coming from different directions, keeping things organized and maintaining clear, effective communication is tough. It only gets worse when there's no defined process to follow. This disorganization confuses everyone, delays responses, and increases the incident escalation rate. Enter the incident commander (IC).

Read Post

Statuspal

Read more about What Is the Role of an Incident Commander?

Incident response and awareness acceleration: What we can learn from responders of Queenstown floods.

Oct 3, 2023 By Kaushik Thirthappa In Spike

I was visiting Queenstown, New Zealand last week amidst the horrible floods which quickly escalated. As an incident responder myself, I was amazed at the operations and how fast responders on the ground acted in evacuating and clearing the grounds. Over 100 people were evacuated in the middle of the night with zero casualties. A commendable job. Here are some observations I made and what we can learn as incident responders ourselves..

Read Post