Incident Management


What Is MTTR? A Simple Definition That Will Help Your Team

Mean time to resolution (MTTR) is defined as the sum of the total amount of time that service was interrupted divided by the number of individual incidents. The unit of measurement is some quantity of time. Ideally, you can use minutes as the unit. That is, unless you blacked out the eastern seaboard for weeks!


Why You Should Standardize Your On-Call Schedules

In the modern era of application delivery and rapid deployment, both developers and IT professionals need to take on-call responsibilities. Monitoring and alerting needs to encompass the entire stack. From product development to managing systems in production, you need a holistic incident response and on-call strategy that allows you to quickly identify issues and avoid downtime. IT operations teams and developers alike need to be on-call for the services they build and maintain.


2019 Hurricane Season: Solidify a Business Continuity Plan With a Mass Notification Solution

Summer is typically synonymous with beach days, outdoor barbecues and fulfilling weekend getaways. Unfortunately, the summer months aren’t only about enjoyable moments and exciting vacations. It’s also tropical storm season, with higher risks of destruction, community displacement and business operation disruption. With this potential for human and business peril, it’s important for organizations to implement a business continuity plan, equipped with a robust communication strategy.


ChatOps and Using Hubot for Incident Response

ChatOps is a method for using tools to execute commands, surface alert context and take action directly through chat. You can align human workflows with application and infrastructure health, making it easier to communicate and fix problems from a single tool. DevOps and IT teams are using chatbots like Hubot to execute commands directly from chat tools like Slack or Microsoft Teams.


Best Practices for Managing Multiple On-Call Teams

Alerting has come a long way from the days of paging an on-call administrator in the middle of the night, to multiple on-call teams that run and manage incident response around the clock. This is because as organizations grow and scale, responding to incidents also gets more complex and you often need more than one team to get involved to successfully resolve an incident.


Keep stakeholders in the know with Incident Timeline from Opsgenie

Technology is changing the world faster than ever. Thanks in part to the rise of the Software-as-a-Service (SaaS) model, customers have come to expect the apps they use to be accessible at all times. As a result, companies are transforming the way their teams operate in order to meet these demands. And perhaps no team experiences the impact of a transformation like this more than IT.


Serverless Event-Driven Workflows with PagerDuty and Amazon EventBridge

This week’s AWS Summit in New York was an exciting one for both AWS and PagerDuty. The AWS team rolled out Amazon EventBridge, a set of APIs for AWS CloudWatch Events that makes it easy for AWS SaaS partners to inject events for their customers to process in AWS. PagerDuty is excited to continue and deepen our long partnership with AWS by supporting EventBridge as a launch partner.


No CMDB? No problem. Not for BigPanda.

I hear it all the time when talking to future BigPanda customers; “I’m not sure BigPanda can really help me correlate all these alerts together because our CMDB is very immature.” Or sometimes, they don’t even have a CMDB, and incorrectly assume this disqualifies them from meaningful noise reduction and alert correlation. I’m happy to tell you the same thing I tell the folks who are looking at BigPanda for the first time. “No CMDB? No problem!”.


Proactive Incident Response With Contextual Monitoring and Alerting

In a world of rapid software delivery and CI/CD (continuous integration and delivery), things break. Servers go down, third-party services fail and new code in production can cause unforeseen incidents. So, effective monitoring and alerting are imperative to maintaining highly reliable services. With context appended to alerts, on-call responders can quickly identify the services that are having issues and get those alerts to the right people.