Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Incident Review & Postmortem Reports: 8 Best Practices

People make mistakes, technology breaks down, and processes aren’t infallible. But, when incidents happen, what can we do about it? What can we learn? As with all things, learning isn’t a binary action, it’s a process. And, when an incident occurs, organizations typically conduct a post-mortem analysis and generate a post-incident review to uncover what went wrong and why.

Sponsored Post

How to Spot the Effects of Alert Fatigue

Imagine being part of an overactive group chat that causes your phone to buzz every few minutes. In the beginning, you open every message but soon realize that most of them aren't important-or at least are not relevant to you. So, what do you do next? Maybe you let the messages pile up and check them later. Or perhaps, you mute the group chat and ignore the incoming messages altogether. You can blame this tendency to ignore or avoid incoming messages or notifications on one culprit: alert fatigue.

How Retrospective Data Enhances Reliability Insights

When things go wrong, we try to learn for the next time. Every incident should be a learning opportunity to make your system more reliable for the future. Luckily with Blameless Reliability Insights, you can see patterns in incidents at a glance, right out of the box. In fact, the ability to tag incidents makes reliability data even more helpful by allowing you to collect granular details about reliability, especially as they pertain to your unique business needs. ‍

FireHydrant Tasks provide turn-by-turn navigation during an incident

An incident has been declared and your runbook has fired. Everyone is gathered in your Slack channel, the tickets are opened, and roles are assigned. Now what? This is when most teams manually update status pages and kickoff investigation streams using a patchwork of tribal knowledge and supporting playbook documents.

Why SREs Need to Embrace Chaos Engineering

Reliability and chaos might seem like opposite ideas. But, as Netflix learned in 2010, introducing a bit of chaos—and carefully measuring the results of that chaos—can be a great recipe for reliability. Although most software is created in a tightly controlled environment and carefully tested before release, the production environment is harsher and much less controlled.

Episode 5: Mooving to... Practical Postmortems

Episode 5, Mooving to… Practical Postmortems covers how to leverage postmortems to effectively learn from failure. Postmortems are a commonplace reference and are now considered a best practice in most modern engineering teams. However, there’s still a lot of confusion on what postmortems should be – and more importantly, what they should NOT be. Thom Duran, Senior Manager of Productivity from Panther walks us through all that and more in the latest Mooving To.. episode!

What should you choose? Docker Swarm vs Kubernetes

Since the introduction of containerisation by Linux many years ago, maturity has shifted from the traditional virtual machine to these containers. These tools have made application development much easier than the initial process. Docker Swarm and Kubernetes came into action when the number of containers increased within a system, they helped orchestrate these containers. A question that arises is, which one is the better option?

Top Incident Response Metrics & How to Use Them

Two categories a software organization should always strive to improve in are: Data analysis is one way that your organization can improve the efficiency of incident management and overall application quality. However, the questions remain – which metrics should be collected? How can analysis of these metrics facilitate these improvements? Read on to hear about five key metrics essential to incident response.

Our fully-redesigned incident response experience delivers a more intuitive workflow

Today we’re releasing fully redesigned Slack and Command Center experiences for FireHydrant so anyone on your team can intuitively navigate the incident response process — in the app or on the web. There are many things you can do ahead of an incident to help things run smoothly: design and document your process, automate predictable steps, train the team, and run drills.