Operations | Monitoring | ITSM | DevOps | Cloud

Mastering the Sev0

Remind yourself of the worst incident your organization has faced. If you’re lucky it might have been your entire service being offline for a period of time. Less lucky, and perhaps you encountered something affecting the sensitive data your organization is the custodian of. Whilst uncommon, incidents of this severity happen to every organization at some point. This criticality of situation is what many refer to as a Sev0, the most severe of incidents.

Don't take a cookie cutter approach to incident management with Toby Jackson

This week, we have a really fun conversation lined up. For this episode, we chatted with Toby Jackson, Global SRE Team Lead at Future, about why it’s a bad idea to take a cookie-cutter approach to incident management or, put another way, why it’s not a good idea to treat all incidents alike. In our conversation, we discuss what’s wrong with this approach, some situations where this might actually make sense, how psychological safety factors into this conversation, and a whole lot more.

Exoskeletons not robots

In this clip, Pete explains why we've taken the approach of "exoskeletons, not robots" when building with AI. It’s fair to say that AI is here to stay. So, as companies grapple with this reality, they’re putting their best foot forward to build AI features that really make a difference for their customers. But should you be building these features if there’s no obvious fit in your product? And even if there is, are you making sure to stay true to your product principles?

Building AI features? Don't forget your product principles

It’s fair to say that AI is here to stay. So, as companies grapple with this reality, they’re putting their best foot forward to build AI features that really make a difference for their customers. But should you be building these features if there’s no obvious fit in your product? And even if there is, are you making sure to stay true to your product principles? The reality is that deciding to build AI into your product isn’t a decision you make on a whim.

The importance of psychological safety in incident management

When an incident strikes, it often brings a whirlwind of stress for everyone involved—from the teams directly handling the issue to the stakeholders making crucial decisions. Imagine support teams on high alert, customers anxiously awaiting resolutions, and executives probing for answers to steer the company through turbulent times. This mounting pressure can make a challenging situation nearly unmanageable, especially when faced with problems that are new or unexpected.

Clinical troubleshooting with Dan Slimmon

It’s no secret that teamwork is one of those things that, when done right, can make a world of a difference. So sometimes, when responding to a particularly complicated incident, it can be best to bring a team together to figure out what’s going on and work towards a fix. But it’s not enough to just jam a bunch of folks into a room and hope for the best. You need a framework in place to ensure that everyone stays focused, diagnoses the issue and resolves it as quickly as possible.

What is clinical troubleshooting? #incidentmanagement #incidentresponse #sitereliabilityengineering

In this clip, Dan Slimmons explains what this clinical troubleshooting framework entails. It’s no secret that teamwork is one of those things that, when done right, can make a world of a difference. So sometimes, when responding to a particularly complicated incident, it can be best to bring a team together to figure out what’s going on and work towards a fix. But it’s not enough to just jam a bunch of folks into a room and hope for the best. You need a framework in place to ensure that everyone stays focused, diagnoses the issue and resolves it as quickly as possible.

Learning is an iterative process #incidentmanagement #incidentresponse #sitereliabilityengineering

In this clip, Viktor Stanchev explains why it's important to remember that learning is an iterative process. Whether you’re a seasoned vet when it comes to incident response, or just getting started out, it can be easy to fall into the trap of doing too much all at once. And it just makes sense. Incident response is one of those things that doesn’t have a single, perfect formula, so teams can be left doing a little bit of everything in an effort to get it right.

It's better to declare incidents early #incidentmanagement #sitereliabilityengineering

In this clip, Viktor Stanchev explains why it's better to declare incidents early rather than too late. Whether you’re a seasoned vet when it comes to incident response, or just getting started out, it can be easy to fall into the trap of doing too much all at once. And it just makes sense. Incident response is one of those things that doesn’t have a single, perfect formula, so teams can be left doing a little bit of everything in an effort to get it right.