|
By Kate Bernacchi-Sass
The world of tech is full of acronyms. SLOs are one of those that everyone talks about, but maybe not everyone fully gets. Whether you're nodding along in meetings or just hearing “SLO” for the first time, we’ve got you covered. In this post, we’ll break down what Service Level Objectives (SLOs) actually are, why they matter, and how they can help keep your systems (and your sanity) in check.
|
By Chris Evans
An Ultimate Guide to on-call schedules? You might think this sounds overly grandiose for what’s essentially putting people into a list and rotating through them. But you’d be flat-out wrong. Getting your on-call setup correct is as real and as important as it gets, and getting things wrong can lead to prolonged incidents, burnt out employees, and damaged company reputation.
|
By Lambert Le Manh
Data quality testing is a subset of data observability. It is the process of evaluating data to ensure it meets the necessary standards of accuracy, consistency, completeness, and reliability before it is used in business operations or analytics. This involves validating data against predefined rules and criteria, such as checking for duplicates, verifying data formats, ensuring data integrity across systems, and confirming that all required fields are populated.
|
By Charlie Kingston
Last year, we released Catalog—the connected map of everything in your organization. Catalog was built with the aim of tackling one of the most painful parts of incident response: contextualizing problems and understanding their place within your organization.
|
By Martha Lambert
At incident.io, we run an on-call product. Our customers need to be sure that when their systems go wrong, we’ll tell them about it—high availability is a core requirement for us. To achieve the level of reliability that’s essential to our customers, excellent observability (o11y) is one of the most important tools in our belt. When done right, observability improves your product experience from two angles.
|
By Ed Dean
There’s a major outage. Support tickets are mounting. Everybody from engineering to legal is scrambling for information. You have more Teams notifications clamouring for attention than you do minutes to address them, and it’s hard to know where to begin. What comes next is a balancing act—mitigating the impact, updating colleagues, managing action items, or updating a status page that will be seen by millions.
|
By Rory Malcolm
With the release of On-call, our system’s reliability had to be solid from the outset. Our customers have high expectations of a paging product—and internally, we would not be comfortable with releasing something that we weren’t sure would perform under pressure. While our earlier product, Response, was the core of a customer’s incident response process after an incident was detected, we’re now the first notification an engineer gets when something’s wrong.
|
By Eryn Carman
We were curious: once an incident is over, how long does it take companies to document, review, create learnings, finish clean-up items, and complete any other follow-up action items? We work with a wide variety of companies, from small start-ups to Enterprises with thousands of engineers. But we wanted to know: where is their time spent after they resolve an incident? Here’s what we found!
|
By Navo Das
Historically, data teams have not been closely involved in the incident management process (at least, not in the traditional “get woken up at 2AM by a SEV0” sense). But with a growing involvement of data (and therefore data teams) in core business processes, decision making, and user-facing products, data-related incidents are increasingly common, and more important than ever.
|
By Stephen Whitworth
Today marks a particularly challenging day for incident responders across the globe. As many of you may have noticed, a recent update from CrowdStrike has triggered widespread disruptions, causing chaos in various sectors. The ripple effects have been far-reaching and severe: While the technical specifics of the issue might not be the focus here—and indeed, there are experts better suited to dissect the cause—what's crucial is understanding the impact on those who manage such crises.
|
By Incident.io
Like it or not, AI is having a monumental impact on our lives. Most of the products we engage with today have AI features and functionality, aimed at assisting or completely replacing the actions normally taken by humans. When it comes to incidents, we’re firm believers of accelerating human actions, and believe the risk of over-automation far outweighs the benefits. In this live event we’ll dig a little deeper on why, as we cover the power and pitfalls of AI.
|
By Incident.io
In this episode, Norberto (VP of Engineering) and Lawrence (Product Engineer) delve into the recent CrowdStrike incident that began on July 19th. Rather than focus on technical specifics, they provide a thoughtful exploration of key aspects that matter to us at incident.io, such as effective communication, overall response strategies, and proactive problem-solving during crises.
|
By Incident.io
Gone are the days when incidents were manual to resolve, invisible to customers, and overall viewed with a negative lens. This is part two of the virtual event series as we dive into our fresh take on what incidents should look like, The Incident Way, and hear from customer stories putting these principles into practice.
|
By Incident.io
Scaling incident management processes can present massive challenges for an organization as large and complex as Netflix. And for Netflix, whose brand has become synonymous with dependability, there’s a lot at stake. Since its introduction to a specific set of Netflix teams, incident.io has been organically adopted far and wide across Netflix Engineering, highlighting just how indispensable and impactful the tool has become.
|
By Incident.io
During a recent episode of The Debrief, we spoke with Jeff Forde, Architect on the Platform Engineering team at Collectors, about building an incident management program at various stages of growth. In that episode, we called it growth from zero to one, one to two, and two to three. But what happens once you’ve scaled beyond three and answers to question you may have become that much harder to find.
|
By Incident.io
In this event uncover the common pains associated with legacy incident management norms and why they don’t meet the mark for modern needs.
|
By Incident.io
This week, we have a really fun conversation lined up. For this episode, we chatted with Toby Jackson, Global SRE Team Lead at Future, about why it’s a bad idea to take a cookie-cutter approach to incident management or, put another way, why it’s not a good idea to treat all incidents alike. In our conversation, we discuss what’s wrong with this approach, some situations where this might actually make sense, how psychological safety factors into this conversation, and a whole lot more.
|
By Incident.io
In this clip, Pete explains why we've taken the approach of "exoskeletons, not robots" when building with AI. It’s fair to say that AI is here to stay. So, as companies grapple with this reality, they’re putting their best foot forward to build AI features that really make a difference for their customers. But should you be building these features if there’s no obvious fit in your product? And even if there is, are you making sure to stay true to your product principles?
|
By Incident.io
It’s fair to say that AI is here to stay. So, as companies grapple with this reality, they’re putting their best foot forward to build AI features that really make a difference for their customers. But should you be building these features if there’s no obvious fit in your product? And even if there is, are you making sure to stay true to your product principles? The reality is that deciding to build AI into your product isn’t a decision you make on a whim.
- September 2024 (3)
- August 2024 (4)
- July 2024 (12)
- June 2024 (8)
- May 2024 (13)
- April 2024 (18)
- March 2024 (15)
- February 2024 (18)
- January 2024 (9)
- December 2023 (10)
- November 2023 (5)
- October 2023 (10)
- September 2023 (16)
- August 2023 (3)
- July 2023 (8)
- June 2023 (6)
- May 2023 (4)
- April 2023 (8)
- March 2023 (2)
- February 2023 (5)
- January 2023 (5)
- December 2022 (3)
- November 2022 (4)
- October 2022 (10)
- September 2022 (7)
- August 2022 (11)
- July 2022 (6)
- June 2022 (3)
- May 2022 (2)
- April 2022 (3)
- March 2022 (6)
- February 2022 (7)
- January 2022 (2)
- December 2021 (5)
- November 2021 (5)
- October 2021 (2)
Create, manage and resolve incidents directly in Slack. Leave the admin and reporting to us.
Improving your incident response, visibility, and ability to learn:
- Less faffing, more fixing: We take care of the admin during incidents, so you can save your brainpower for the decisions that matter.
- Divide and conquer: We make sure everyone’s role is clear, track who’s working on what, and help you escalate if you need extra help.
- Get up to speed, at speed: Get everyone on the same page from the moment they join the incident, and help stakeholders stay in the loop.
- Timelines, in no time: Constructing an incident timeline for review is important, but time consuming. We’ll build one for you in real-time, and keep it constantly up to date.
- Data and insights you can trust: You’ve already paid for your incidents. By surfacing the data you need to make decisions, we help you get your money’s worth.
Incident response for your whole organisation.