The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.
For this episode we’re continuing to “Build Things on Purpose” with JJ Tang, co-founder of Rootly, who joins us to talk about incident response, the tool he’s built, and his many lessons learned from incidents. Rootly is aiming to automate some of the more tedious work around incidents, and keeping that consistency. JJ chats about why he and his co-founder built Rootly, and the problems they’re trying to fix and eliminate when it comes to reliability.
Forget the latest tech gadgets and the newest products. One of the most talked about trends in observability right now? “SLOs have really become a buzzword, and everyone wants them,” said Grafana Labs principal software engineer Björn “Beorn” Rabenstein on a recent episode of “Grafana’s Big Tent,” our new podcast about people, community, tech, and tools around observability.
Your IT team just finished resolving a complex incident, customer service finished their last call about the issue, and your business is back to being fully operational. Now that the storm has passed, you should be planning a postmortem to determine the cause of the incident and lessons learned. Postmortems require specific data that can highlight where your team is succeeding and where they can improve.
As the Regional VP of Customer Success for the West and Central Region at BigPanda, Chris LaPierre gets a unique opportunity to see first-hand how BigPanda customers use their AIOps platform. Charged with ensuring every BigPanda customer derives high value and return on investment from the solution, BigPanda’s customer success teams make certain customers leverage the AIOps platform to increase their bottom line.
By now it’s no secret that system outages and website downtime are more widespread and frequent than ever. In fact, the frequency of outages jumped 9% in just the first week of 2022. This can be attributed to a rapid increase in traffic and reliance on tech infrastructures – resulting in connectivity, server, and other technical issues that are alternately unforeseen and unavoidable.
Modern DevOps teams that run dynamic, ephemeral environments (e.g., serverless) often struggle to keep up with the ever-increasing volume of logs, making it even more difficult to ensure that engineers can effectively troubleshoot incidents. During an incident, the trial-and-error process of finding and confirming which logs are relevant to your investigation can be time consuming and laborious. This results in employee frustration, degraded performance for customers, and lost revenue.
To embed or not to embed: That is the question. At least, that’s one of the questions that companies have to answer as they decide how to implement Site Reliability Engineering. They can either embed SREs into existing teams, or they can build a new, separate SRE team. Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.