Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Mattermost Playbooks How-to: Release Management

Releasing software to users has become a sophisticated and intricate process that requires high levels of consistency and coordination. A release has to be built, brought together, documented, tested and deployed, which requires coordination of at least four separate teams and a generous handful of pipelines and other tools. Without a well-documented process things can get messy very quickly, causing stress for everyone involved.

Mattermost Playbooks How-to: Incident Resolution

Whether you’re part of a team managing SaaS products or a high-security digital workspace, sometimes Things Go Wrong and must be addressed with extreme care, professionalism, and predictability. For outages, data breaches, vulnerabilities and more, you and your team are juggling a variety of tools, processes, and rigid incident management systems. When the on-call pager goes off at 3 am almost no one has the ability to remember every step needed to kick off all the response workflows.

The Cost of Downtime: How Much Does an IT Outage Cost Your Business?

Life in the world of managed IT services is not without its pleasant surprises. Although we’re an industry of system builders dedicated to facilitating the smoothest of operations possible, downtime still happens. An unexpected system or network failure is not uncommon. In fact, it's inevitable. Even some of the world’s biggest companies can’t get away without painful outages.

Four key takeaways from our recent webinar: BigPanda picks up where Netcool left off

For years, Netcool has been omnipresent in many IT Operations organizations. That, combined with the sheer utility it once brought to the table, sometimes gave it a special sort of nostalgic reverence in IT Operations circles. But with all due respect to Netcool, there’s also little doubt the platform’s real-world utility has waned in the era of cloud and hybrid ops.

Introducing Grafana OnCall OSS, on-call management for the open source community

Last November, we announced the launch of Grafana OnCall, an easy-to-use on-call management tool that helps reduce toil through simpler workflows and interfaces tailored for developers. Born out of Grafana Labs' acquisition of Amixr Inc., Grafana OnCall began as a cloud-only solution that became generally available to all Grafana Cloud users, on both paid and free plans, in February.

5 Ways to Reduce IT Incidents Before Your Team Succumbs to the Ticket Backlog

If you talk to any Service Desk agent, they will agree there has been an explosion in IT tickets since the transition to remote and hybrid work. Even now, there are growing challenges preventing them from being able to reduce IT incidents. In the last year, average ticket volume has risen by 16% since the pandemic, stressing already overtaxed help desk agents. This increase in tickets has led to wasted resources, poor IT service delivery and frustrated employees.

Squadcast Product Demo | Incident Management | On-call | SRE | Status Page | SLO Tracker | Runbooks

This video explains why Squadcast is a feature-rich solution for SRE, DevOps, and Engineering teams in general. With the ability to help teams quickly mobilize response teams during critical incidents, easily manage on-call schedules, and track SLOs for better SRE, Squadcast is a multi-purpose platform with numerous capabilities. This short video covers everything the product is capable of.

Setting up Route 53 Health Checks

We live in an age where the internet and digital data drive modern day markets, which results in huge amounts of data being generated and consumed. Hence, it has become very important for online platforms to manage this traffic and serve their customers more efficiently. In this blog we will explore the Amazon Route 53 service and see how it addresses domain name system routing and health check problems.