Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.


Announcing Status Pages

Communication is one of the hardest things to do well while responding to incidents. At FireHydrant, we’ve focused on helping people communicate well within their teams when responding to incidents, and also after the fact during post-incident reviews. But what about communicating with your customers? During an incident, your customers want to know that you’re aware of the problem and are working to mitigate or resolve it.


SRE Leaders Panel: Managing Systems Complexity

In our previous panel, we spoke about how to overcome imposter syndrome in high tempo situations, and how culture directly affects the availability of our systems. Building on that last discussion, we gathered leading minds in the resilience industry to discuss how SRE can manage systems complexity, and how that's tightly intertwined with business health especially in the context of current health and social crises.


SLO Adoption at Twitter

This is the second article of a two-part series. Click here for part 1 of the interview with Brian, Carrie, JP, and Zac to learn more about Twitter’s SRE journey. Previously, we saw how SRE at Twitter has transformed their engineering practice to drive production readiness at scale. The concept of service level objectives (SLOs) and error budgets have been key to this transformation, as SLOs shape an organization’s ability to make data-oriented decisions around reliability.


Meet the PagerDuty Product Design Team

How do you design a product that customers love to hate? “Hey, that thing you’re responsible for is down. Oh, and people have noticed and they’re complaining about you on Twitter. OK BYEEE!” Our customers love PagerDuty (they legit tell us this). At the same time, they hate hearing from us because it means trouble.


Twitter's Reliability Journey

Twitter’s SRE team is one of the most advanced in the industry, managing the services that capture the pulse of the world every single day and throughout the moments that connect us all. We had the privilege of interviewing Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zac Kiehl, Sr. Staff SRE to learn about how SRE is practiced at Twitter.


What's New: PagerDuty Fits the Way You Work, Where You Are, and With the Tools You Love!

This month, we are excited to announce a new set of product updates and enhancements built for real time and designed to fit the way you work, where you are, and with the tools you love to use! We continue to focus on enhancing the core functionality and aesthetic of the PagerDuty mobile app to help users better manage their digital operations while on the go. We strive to go beyond harnessing digital data from any software-enabled system to transform any signal into real-time insight and action.


How SLIs Help You Understand Users' Needs

In our article on SLOs, we discussed the need for service level indicators to be relevant to the users’ experience. By consolidating a number of internal metrics into one indicator that reflects the typical use of the service, we can ensure that meeting our SLO means keeping users happy. A good way to think about this is by looking at the user’s experience or journey.


HIPAA-Compliant Text Messaging: Best Practices and Policies

HIPAA compliant text messaging enables healthcare providers to securely communicate with patients and other healthcare providers. To ensure HIPAA compliance, you need to use HIPAA standards to create secure electronic data transmissions (in this case, text messages). The goal is to secure transmissions that contain protected health information (PHI).