Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Introducing Squadcast Premium

For the last few years, Squadcast has been building out a market-leading on-call and alert management solution. Over the past few quarters, we have significantly enhanced our on-call product by releasing and improving features related to Incident Response - including Slack / MS Teams integration, Runbooks, Postmortems, Service Level Objectives, and Status Pages. We believe that a reliability platform involves both on-call and incident response - one cannot work effectively without the other.

DevOps vs. SRE: What's the Difference?

Despite there being significant differences in the roles, DevOps and Site Reliability Engineering are often lumped together because many people assume they do similar work. Although both attempt to reduce the issues arising from software development processes, their goals, skill sets, and approaches are actually quite different. DevOps engineers focus on the development pipeline, and their goal is to enable better development processes and workflows.

Service Catalog: Simplifying Service Management and Ownership

With the adoption of cloud and microservices, modern IT infrastructures operate with a mesh of services that cater to multiple user requirements. It can get very difficult to simultaneously keep track of numerous services. A Service Catalog helps organize service-related information in a single pane, achieve end-to-end service ownership and get real-time performance insights.
Sponsored Post

Exploring PagerDuty Alternatives for Incident Response

Incident response refers to effectively responding to infrastructure issues and resolving them in the shortest time frame possible. Due to several loss-inducing high-profile outages over the last few years, organizations have sought to create rigorous processes with specialized tools to resolve incidents quickly and learn from their failures. As one of the first platforms to enter the incident response space, PagerDuty is a dominant player, but over the years, competing platforms have begun carving out their own niche in the incident response space.

Sponsored Post

The Importance of Observability for Site Reliability Engineers (SREs)

Site reliability engineers (SREs) play a crucial role in ensuring the reliability of systems. From creating software to improving system reliability in production, responding to incidents, and fixing issues, SREs are responsible for guaranteeing the health of applications.. And observability helps support SREs'. Because an observable system allows them to identify and fix issues promptly, resulting in SRE's being better equipped to fast-track development cycles.

Tips to make your Retrospectives Meaningful

If done right, retrospectives can help you inspect past actions, help adapt to future requirements and guide teams towards continuous improvement. However, organizations find it difficult to adopt the right mindset to execute retrospectives effectively. This blog will help you understand what retrospectives are and provide valuable tips to make your retrospectives meaningful. This blog will cover,

Introducing Webforms - Involve end users directly into your Incident Management process

Over the years we’ve received requests from our customers for a feature that can enable their customers and their end users to create/ report incidents directly on Squadcast. To our valued customers - we heard you! We are excited to introduce Webforms to do exactly that. In the past, we’ve addressed the challenges pertaining to On-call processes and best practices that teams can implement.

Managing Squadcast resources with our expanded Terraform provider

Hey folks! We’re excited to announce that we’ve vastly expanded the capabilities of our Terraform provider. Previously, our Terraform provider was limited to creating and managing services as a resource. We have now covered the entire spectrum of resources available on Squadcast right from creating and managing users, escalation policies and also managing SLO’s via our Terraform provider. What does that mean for you?

Using Observability with Kubernetes to Automate Site Reliability Engineering

In this video, Anthony Evans, solution architect, explains how the StackState topology-powered observability platform can help SREs to automate site reliability, putting their organizations on the path to becoming a zero-downtime enterprise. See how StackState helps to unify and correlate data across your stack, visualize your entire IT environment, instantly pinpoint root cause, reduce alert storms and with AIOps capabilities, even prevent problems proactively. It's all here!

What is a Security Operation Center and how do SOC teams work?

With the growing complexity of IT environments, it is essential to have robust security processes that can safeguard IT environments from cyber threats. In this blog, we will explore how security operation centers (SOCs), help you monitor, identify and prevent cyber threats to safeguard your IT environments. This blog covers the following pointers.