Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Building on Chaos Toolkit's Foundation: New Features for Resilience Engineering

On October 26th 2023, we had the pleasure of receiving Manuel Castellin, a seasoned expert in chaos engineering and Terraform, who took us through two real-world examples demonstrating how to overcome the challenges of implementing chaos engineering when your infrastructure isn’t initially prepared for it and securely experiment on production systems. In the second part of the meetup, Sylvain Hellegouarch, Chaos Toolkit lead developer and Reliably CEO, showed a quick demo of how to use Reliably to build your experiments in a less code-centric and more visual way.

Introducing Squadcast's Global Event Rulesets | Incident Management | Squadcast

With video will give you a walkthrough of Squadcast's new feature 'Global Event Rulesets' that helps you simplify alert Routing and boost efficiency Global Event Rulesets enable you to manage alert routing across services and automate actions based on predefined global event rulesets.

Secret to Flawless Deployments: Real-Time Canary Deployment tracking with Argo CD & Levitate!

Most of your outages are probably caused by a change, and having observability around that will make a lot of difference. Dive into this walkthrough, where we showcase tracking Canary deployments in Argo CD, correlating events and metrics seamlessly with Levitate. For Site Reliability Engineers, DevOps engineers, Software Engineers, and Product Managers seeking to elevate their observability and ensure smooth deployments every time.

Tips To Never Miss An Incident Notification With Squadcast Escalations Policies

Companies implement an Incident Response process to promptly resolve critical issues. Setting up escalation policies to notify engineers is a key step in this process. With traditional escalation policies, alert notifications still get missed which results in higher response times and failure to meet SLAs. So, how can one ensure incident notifications are never missed?
Sponsored Post

Opsgenie Alternatives: Finding the Right Fit for your Incident Management Teams

In the dynamic landscape of modern IT operations and Incident Management, choosing the right tool is paramount to ensuring the resilience of your organization. Opsgenie, a popular Incident Response and Alerting platform, has been a go-to choice for many. However, as businesses grow and requirements evolve, exploring Opsgenie alternatives becomes essential in the quest to find the perfect fit for your unique operational needs. In this blog, we'll embark on a journey to uncover and evaluate some compelling alternatives to Opsgenie, helping you navigate the vast sea of options and make an informed decision that aligns perfectly with your team's workflows and objectives.

Webinar: Streamlining Incident Management With Automation and Contextual Awareness

In the modern context of distributed teams & complex digital infrastructure, major incidents having a negative impact spanning multiple teams and services can cause a barrage of alerts. While a meticulously designed incident response strategy can aid in restoring order, it's essential to underscore the significance of providing responders with effective tools that offer contextual understanding and facilitate the identification of actionable alerts.

MSP's As NOC's, Handling Multiple Clients

A Managed Service Provider (MSP) should invest in an Incident Management platform to ensure seamless service delivery and customer satisfaction. Such a platform streamlines Incident Response, improves service reliability, and enhances communication among teams. It helps MSPs in reducing Mean Time to Acknowledge (MTTA) and Mean Time to Resolve (MTTR) incidents, thereby minimizing downtime and service disruptions.

Elevating Incident Management: Leveraging automation and AI to put reliability on autopilott

If your company operates in a modern digital environment, then there’s a good chance questionable reliability is hurting you competitively. On the other hand, every hour your engineering team spends on operations comes at the expense of developing your product. So, what are you supposed to do?

RapidSpike + Squadcast: Routing Alerts Made Easy

RapidSpike is a website monitoring solution that focuses on all three key aspects of website health: performance, reliability and security in a single dashboard. If you use RapidSpike for your website monitoring requirements, you can integrate it with Squadcast, an end-to-end Incident Response tool, to route alerts from RapidSpike to the right users in Squadcast with ease.