November 2023

How to Route Alerts to Subject Matter Experts Using Squadcast Tagging & Routing Rules?

Nov 30, 2023 By Chitra Bisht In Squadcast

Effective Incident Management is crucial for ensuring customer satisfaction and brand loyalty. As systems grow more complex, efficiently directing alerts to the right teams becomes crucial. This article delves into the challenges, implementation, and benefits of automating incident categorization.

Read Post

Squadcast

Read more about How to Route Alerts to Subject Matter Experts Using Squadcast Tagging & Routing Rules?

How to improve your IT alert management: Understanding best practices

Nov 30, 2023 By Amy Brennen In BigPanda

As an IT leader, you’re under significant pressure to control the constant alerts. Somehow, you must manage non-stop IT alerts while also ensuring ultra-high service availability. The task is far from easy, and even the most sophisticated teams struggle to keep up and turn alerts into action with tech stacks that are constantly growing in size and complexity. IT alert management is the first line of defense.

Read Post

BigPanda

Read more about How to improve your IT alert management: Understanding best practices

Your guide to better incident status pages

Nov 30, 2023 By Jouhné Scott In FireHydrant

Your status page (or lack thereof) has the opportunity to signal a lot about your brand — how transparent you are, how quickly you respond to incidents, how you communicate with your customers — and ultimately, this all seriously impacts your reliability. After all, as our CEO Robert put it in a recent interview on the SRE Path podcast, you don’t get to decide your reliability; your customers do.

Read Post

FireHydrant

Read more about Your guide to better incident status pages

What is Incident Management? Unpacking the Complexity

Nov 30, 2023 By Sirine Karray In iLert

In the increasingly digital world, tech-savvy professionals strive to maintain reliable and efficient operations that ensure customer satisfaction and uphold trust. Incident Management is an essential component in achieving those goals. This article delves into the complexities of Incident Management, highlighting essential tools and processes that contribute to effective response and resolution strategies.

Read Post

iLert

Read more about What is Incident Management? Unpacking the Complexity

Announcing the StatusCast Mobile App: A Game-Changer for Status Page Users

Nov 30, 2023 By StatusCast In StatusCast

We are thrilled to introduce the latest innovation from StatusCast: our groundbreaking mobile status page application, which will be available on both Android and iOS platforms. This launch marks a significant milestone in the evolution of status page accessibility, offering unparalleled convenience and functionality to your power users, the subscribers.

Read Post

StatusCast

Read more about Announcing the StatusCast Mobile App: A Game-Changer for Status Page Users

Everbridge Webinar: Increased Terrorism Risks during the Holiday Season

Nov 30, 2023 By Everbridge In Everbridge

Watch as Director of CEM Product Marketing Sean McDevitt and EMEA Risk Intelligence Regional Analyst James Burr discuss the increased risks at large public gatherings throughout Europe during the holiday season. They will also showcase key functionality Everbridge customers may utilize to keep their employees safe during the next several weeks.

View Video

Everbridge

Read more about Everbridge Webinar: Increased Terrorism Risks during the Holiday Season

#5 Rundeck by Pagerduty Community Meetup: Automate Kubernetes w/ Rundeck (Part 3)

Nov 30, 2023 By PagerDuty In PagerDuty

Session III: Automate Kubernetes with Rundeck Speaker: Justyn Robberts, Sr. Solutions Consultant @ PagerDuty Get together with the Rundeck by PagerDuty Process Automation crew in this 5th Community Meetup and learn how automation is leading La Sapienza University of Rome and Application Performance's way to innovation and fast tracking business for the future.

View Video

PagerDuty

Read more about #5 Rundeck by Pagerduty Community Meetup: Automate Kubernetes w/ Rundeck (Part 3)

Navigating the New SEC Data Breach Rule A Blameless Blueprint for Compliance

Nov 29, 2023 By Blameless In Blameless

The new SEC rule on material security breaches goes into effect on December 18, 2023 for larger publicly traded companies and all other public companies within 180 days. If you're not already in compliance, it’s important for you to prepare for the new rule now by developing a plan for incident response and disclosure.

View Video

Blameless

Read more about Navigating the New SEC Data Breach Rule A Blameless Blueprint for Compliance

Incidents are inevitable, but chaos is optional.

Nov 29, 2023 By StatusCast In StatusCast

Ever wondered how to navigate through unexpected challenges without succumbing to chaos? Our short video explores the art of managing incidents effectively, showcasing practical strategies to keep chaos at bay. Dive into insightful tips and real-world examples that demonstrate how proactive planning and a resilient mindset can turn potential chaos into an opportunity for growth. Whether you're a business professional, student, or someone eager to enhance their problem-solving skills.

View Video

StatusCast

Read more about Incidents are inevitable, but chaos is optional.

Are you down?

Nov 29, 2023 By StatusCast In StatusCast

Discover the power of streamlined communication with StatusCast as we delve into how our platform can revolutionize the way you handle incidents and keep everyone on the same page. From status updates to incident resolution, this video is your gateway to seamless collaboration. Dive into real-world scenarios showcasing how Statuscast.com ensures that everyone stays informed, minimizing downtime and maximizing productivity. Learn how to turn potential setbacks into opportunities for growth with our intuitive platform.

View Video

StatusCast

Read more about Are you down?

What is ServiceNow change management - and how does AIOps optimize it?

Nov 29, 2023 By Amy Brennen In BigPanda

Effective IT change management is essential for maintaining smooth operations in today’s fast-paced, agile IT environment. Given that 85%, or the vast majority, of incident-impacting alerts result from changes, optimizing your change management means improving your incident management and ensuring critical system reliability. So whether your organization uses ServiceNow for change management or is considering using ServiceNow, we’ll walk you through everything you need to know.

Read Post

BigPanda

Read more about What is ServiceNow change management - and how does AIOps optimize it?

PagerDuty Named a Leader in GigaOm's Inaugural 2023 Incident Response Platforms Radar Evaluation

Nov 29, 2023 By Zack Linky In PagerDuty

In a world where organizations of all industries increasingly rely on digital innovation and experiences to create differentiation in the market, it has never been more critical to ensure the integrity of their operations are safeguarded against unforeseen outages and incidents. Operational disruptions today can have a major impact on brand reputation, create negative revenue implications and impact customer loyalty.

Read Post

PagerDuty

Read more about PagerDuty Named a Leader in GigaOm's Inaugural 2023 Incident Response Platforms Radar Evaluation

Unlocking Visibility and Control: Introducing Squadcast's Service Graph Feature

Nov 28, 2023 By Vishal Padghan In Squadcast

To ensure efficient Incident Management, it is crucial to proactively anticipate and address potential disruptions The need for a comprehensive, high-level view of the status of all services is paramount. Enter Squadcast's Service Graph – a feature designed to transform the way organizations approach Incident Management.

Read Post

Squadcast

Read more about Unlocking Visibility and Control: Introducing Squadcast's Service Graph Feature

Comparing the Top 9 Pagerduty Alternatives in 2023

Nov 28, 2023 By Abhishek Sony In Squadcast

Pagerduty is a popular Incident Management platform that helps teams respond to alerts and incidents quickly and efficiently. However, its pricing structure can be complex and expensive for scaling businesses and Incident Response teams. In this blog post, we will compare the top 9 Pagerduty alternatives in 2023, and help you to choose the best one for your needs.

Read Post

Squadcast

Read more about Comparing the Top 9 Pagerduty Alternatives in 2023

Engineering nits: Building a Storybook for Slack Block Kit

Nov 28, 2023 By Lawrence Jones In Incident.io

We care a lot about the pace of shipping at incident.io: moving fast is a fundamental part of our company culture, and out-pacing your competition is one of the best ways we know to win. In engineering teams, one way to ship fast is to invest in tools that make your team more productive. We've become good at identifying small pains and frustrations that slow us down over time and – after surfacing them to the rest of the team – find solutions for them.

Read Post

Incident.io

Read more about Engineering nits: Building a Storybook for Slack Block Kit

Build Operational Resilience with Generative AI and Automation

Nov 28, 2023 By Ariel Russo In PagerDuty

For modern enterprises aiming to innovate faster, gain efficiency, and mitigate the risk of failure, operational resilience has become a key competitive differentiator. But growing complexity, noisy systems, and siloed infrastructure have created fragility in today’s IT operations, making the task of building resilient operations increasingly challenging.

Read Post

PagerDuty

Read more about Build Operational Resilience with Generative AI and Automation

Automate insights-rich incident summaries with generative AI

Nov 28, 2023 By Noam Morginstin In Exigence

Does this sound familiar? The incident has just been resolved and management is putting on a lot of pressure. They want to understand what happened and why. Now. They want to make sure customers and internal stakeholders get updated about what happened and how it was resolved. ASAP. But putting together all the needed information about the why, how, when, and who, can take weeks. Still, people are calling and writing. Nonstop.

Read Post

Exigence

Read more about Automate insights-rich incident summaries with generative AI

What is PagerDuty - and how does it work with BigPanda?

Nov 28, 2023 By Amy Brennen In BigPanda

PagerDuty is an IT operations management platform and cloud computing company launched in 2009. They provide a suite of tools designed to help IT and DevOps teams detect and respond to infrastructure problems, streamline workflows, and improve operational reliability. The PagerDuty platform bridges different systems and the teams that maintain them, centralizing the detection and reporting of incidents. It allows organizations to minimize downtime and resolve issues efficiently.

Read Post

BigPanda

Read more about What is PagerDuty - and how does it work with BigPanda?

Managing Databases on AWS: A Practical Guide

Nov 28, 2023 By Ritika Bramhe In OnPage

Amazon Web Services (AWS) provides a range of managed database services that provide multiple database technologies to handle various use cases. They are designed to free businesses from tasks like database administration, maintenance, upgrades, and backup. AWS databases come in several types to cater to different business needs.

Read Post

OnPage

Read more about Managing Databases on AWS: A Practical Guide

Top 5 Incident Response Tools to Watch Out for in 2024

Nov 27, 2023 By Chitra Bisht In Squadcast

Having effective incident response tools is crucial for IT organizations. Improving your incident response process is enhanced when equipped with the appropriate tool that includes intelligent features tailored to your needs. Whether you're just beginning your venture into efficient Incident Management or in search of the finest incident response tools, we present the top five options for your consideration.

Read Post

Squadcast

Read more about Top 5 Incident Response Tools to Watch Out for in 2024

Build custom monitoring and remediation tools with the Datadog App Builder

Nov 27, 2023 By Thomas Sobolik In Datadog

When you’re responding to an issue with your application in the heat of on-call, you need reliable, well-maintained tooling that’s painless to use. Otherwise, the time you’ll spend combing through monitoring data for context, connecting to hosts and other infrastructure resources, and pivoting between consoles for various managed services can add up quickly and slow your response.

Read Post

Datadog

Read more about Build custom monitoring and remediation tools with the Datadog App Builder

Top SRE Tools for Enhanced Site Reliability

Nov 27, 2023 By Anjali Udasi In Zenduty

Site Reliability Engineering (SRE) stands out as a crucial discipline, ensuring the smooth operation and scalability of intricate software systems. SREs employ a diverse toolkit, automating tasks, monitoring system health, and proactively tackling potential issues. The goal? To elevate site reliability and keep downtime at bay. In this blog, we'll dive deep into the realm of SRE tools, breaking down what each tool brings to the table.

Read Post

Zenduty

Read more about Top SRE Tools for Enhanced Site Reliability

Your incident declaration form is (probably) too long: The power of concise reporting

Nov 27, 2023 By Matilda Hultgren In Incident.io

It’s 10am, your coffee is ready and piping hot, and you have just been paged. Looks like is down, and customers are starting to notice. With no time to lose, you open up your organization’s incident declaration form and you spend the next thirty minutes filling out the fifteen required fields, while the incident grows bigger and more complex, messages are rolling in, and your coffee grows cold.

Read Post

Incident.io

Read more about Your incident declaration form is (probably) too long: The power of concise reporting

PagerDuty Copilot | Generative AI for PagerDuty Operations Cloud

Nov 27, 2023 By PagerDuty In PagerDuty

Introducing PagerDuty Copilot: Your GenAI assistant for critical operations work. For scaling your teams. For sustaining customer experiences. For moving business forward – faster. Work more efficiently. Protect more revenue. Build greater operational resilience. PagerDuty Copilot is the AI assistant operations teams trust to help them manage business-impacting issues in seconds, not hours. From event to resolution, PagerDuty Copilot’s automations help you resolve issues faster, reduce risk, and control costs.

View Video

PagerDuty

Read more about PagerDuty Copilot | Generative AI for PagerDuty Operations Cloud

Improving Customer Support with Squadcast Webforms: A Smart Solution for MSPs

Nov 24, 2023 By Chitra Bisht In Squadcast

Managed Service Providers (MSPs) handle a multitude of customer support cases, each requiring efficient routing to the right team member. Squadcast's Webforms provide a solution to expedite issue reporting and streamline resolution. In this blog, we will explore how MSPs can leverage webforms to enhance the customer support experience.

Read Post

Squadcast

Read more about Improving Customer Support with Squadcast Webforms: A Smart Solution for MSPs

Introducing Workflows: Enhancing Automation in Incident Response

Nov 23, 2023 By Sanjog Sandhu In Squadcast

At Squadcast, we advocate for the principles of Site Reliability Engineering (SRE), which emphasize the critical importance of automating routine tasks to boost efficiency in Incident Management. We're aiding organizations in implementing these principles with one of our newest features: 'Workflows'. Workflows has been designed to automate manual facets of your Incident lifecycle, all while ensuring human-in-the-loop execution for critical decisions.

Read Post

Squadcast

Read more about Introducing Workflows: Enhancing Automation in Incident Response

Terraform Time | Let's move on with PagerDuty Incident Workflows for Incident Response automation

Nov 23, 2023 By PagerDuty In PagerDuty

Building on top last week Terraform Time episode let's continue with our journey to automate Incident Response tasks utilizing PagerDuty Incident Workflows.

View Video

PagerDuty

Read more about Terraform Time | Let's move on with PagerDuty Incident Workflows for Incident Response automation

Introducing Squadcast Workflows | Automating Incident Response | Squadcast

Nov 23, 2023 By Squadcast In Squadcast

This video introduces you to Squadcast Workflows, a new feature that lets you effortlessly automate repetitive tasks during an incident, allowing your team to focus on Incident Resolution.

View Video

Squadcast

Read more about Introducing Squadcast Workflows | Automating Incident Response | Squadcast

Best Practices to Avoid Website Outages on Black Friday

Nov 22, 2023 By OnPage Corporation In OnPage

The most frenzied shopping day of the year – Black Friday – is fast approaching, and businesses around the globe are bracing themselves. However, imagine this – a massive number of eager shoppers ready to snag the hottest deal, and just when your website should be working at its best, it crashes, leaving behind frustrated customers and potential revenue slipping through your virtual fingers. This scenario is not entirely fictional.

Read Post

OnPage

Read more about Best Practices to Avoid Website Outages on Black Friday

Resilience Engineering in 2024: Challenges, Trends, & Priorities

Nov 22, 2023 By Incident.io In Incident.io

Is your organization ready to fortify, expand, and cultivate a robust resilience engineering culture in 2024? In this webinar Chris Evans (Co founder & Chief Product Officer, incident.io) and Courtney Nash (Internet Incident Librarian, The VOID) will delve into crucial considerations and top priorities for improving your organization’s ability to build safer and more reliable complex systems while unlocking insights for shaping your plans for 2024 and beyond.

View Video

Incident.io

Read more about Resilience Engineering in 2024: Challenges, Trends, & Priorities

Quick start guide to Unified Analytics dashboards

Nov 21, 2023 By Olivia Sell In BigPanda

When it comes to observability, we’ve found that most organizations have ~20 tools installed in their IT environments. With so many tools, it’s difficult for IT leaders to gain insight into how their tools are performing and determine how much value ITOps is bringing to the organization.

Read Post

BigPanda

Read more about Quick start guide to Unified Analytics dashboards

Weathering Black Friday and Other Storms Reliably

Nov 21, 2023 By Emily Arnott In Blameless

If you work in eCommerce, you can see the storm on the horizon. Black Friday, the biggest shopping day of the year both online and off, is only a few days away. Your services are going to hit usage spikes you possibly have never seen before. And it will be all aspects of your services pushed to your limit – people won’t just be searching, or just buying, or signing up for programs, they’ll be doing all of these at once. ‍ Most crucially, everyone else is offering deals too.

Read Post

Blameless

Read more about Weathering Black Friday and Other Storms Reliably

Should data teams consider incident management tools to respond to pipeline issues?

Nov 21, 2023 By Jack Colsey In Incident.io

Data teams are adopting more processes and tools that align with software engineering, and from talks at the dbt Coalesce conference in 2023, there’s clearly a big push towards adopting software engineering practices at enterprise scale companies. At the moment, there are a lot of tools in the data space for identifying errors in data pipelines, but no tools for responding to these errors, such as coordinating fixes. This is exactly where an incident management platform makes sense to implement.

Read Post

Incident.io

Read more about Should data teams consider incident management tools to respond to pipeline issues?

Guide To Best Incident Management Software

Nov 20, 2023 By Chitra Bisht In Squadcast

Avoiding downtime is imperative. To keep you sturdy against any unplanned disruptions there are Incident Management tools ensuring quick response, efficient resolution, and minimal impact on operations. This blog aims to be your go-to guide for navigating the diverse landscape of Incident Management platforms.

Read Post

Squadcast

Read more about Guide To Best Incident Management Software

Captains Log: How we are leveraging CEL for Signals

Nov 20, 2023 By Robert Ross In FireHydrant

As engineers, we didn't want to make Signals only a replacement for what the existing incumbents do today. We've had our own gripes for years about the information architecture many old companies still force you to implement today. You should be able to send us any signal from any data source and create an alert based on some conditions. We're no strangers to building features that include conditional logic, but we upped the ante when it came to Signals.

Read Post

FireHydrant

Read more about Captains Log: How we are leveraging CEL for Signals

IAG Relies on PagerDuty Operations Cloud for Sustainable Growth

Nov 20, 2023 By PagerDuty In PagerDuty

Part of the International Airlines Group (IAG), IAG Loyalty operates the loyalty programs for IAG’s airlines—British Airways, Iberia, Vueling and Aer Lingus—and 125+ global brand partners in travel, retail, and financial services. With the PagerDuty Operations Cloud, IAG Loyalty has built a framework that allows engineers to build products and services in a fast and safe way. This has laid the foundation for sustainable growth as a company. Hear more in this video from Colin Lewis, Head of Core Engineering at IAG Loyalty and James Headon, Cloud Operations Manager at IAG Loyalty.

View Video

PagerDuty

Read more about IAG Relies on PagerDuty Operations Cloud for Sustainable Growth

Tip of The Day : Resend Notifications and Set Notification Preferences

Nov 20, 2023 By StatusCast In StatusCast

Unlock the power of effective communication! Tune in to our latest Tip of the Day video on StatusCast.com, where we delve into 'Resend Notifications' and guide you on optimizing your experience by setting personalized notification preferences. Stay informed, stay empowered!

View Video

StatusCast

Read more about Tip of The Day : Resend Notifications and Set Notification Preferences

Status Pages and Incident Management for IT Enterprise

Nov 20, 2023 By StatusCast In StatusCast

Ready to revolutionize your IT Enterprise? Look no further! Explore the dynamic world of StatusCast.com, where Status Pages and Incident Management come together to redefine how you handle IT disruptions. Why StatusCast.com? StatusCast.com is not just a tool; it's your strategic partner in maintaining the health and performance of your IT systems. Our platform offers a comprehensive solution for creating informative and visually appealing status pages, ensuring your users are always in the loop during incidents.

View Video

StatusCast

Read more about Status Pages and Incident Management for IT Enterprise

What is tool consolidation - and how can AIOps optimize it?

Nov 20, 2023 By Amy Brennen In BigPanda

Tool consolidation is the process of analyzing which IT observability and monitoring tools to use, which to add, and which to retire. By carefully determining the usage and value of your current observability stack, your ITOps teams can consolidate redundant tools and those providing little value to reduce your operational costs. While the benefits of tool consolidation are clear, doing so is anything but.

Read Post

BigPanda

Read more about What is tool consolidation - and how can AIOps optimize it?

Tame observability complexity: Understanding the observability tool landscape

Nov 20, 2023 By Joel McKelvey In BigPanda

Choosing, deploying, maintaining, and rationalizing observability and monitoring tools can be a constant challenge for ITOps, DevOps, and SRE teams. As teams monitor increasingly complex systems, the need for instrumentation that monitors those systems grows at the same rate, leading directly to a growing problem of observability data engineering, integration, and enrichment.

Read Post

BigPanda

Read more about Tame observability complexity: Understanding the observability tool landscape

We are SOC 2 Type II Certified!

Nov 18, 2023 By StatusCast In StatusCast

Discover the essence of trust and security as we delve into Statuscast's SOC 2 certification journey. Watch to understand why this certification matters, ensuring your data's safety and reinforcing our commitment to transparency and excellence. Stay informed, stay secure.

View Video

StatusCast

Read more about We are SOC 2 Type II Certified!

Strengthen operational resilience with Service Chain Mapping. Watch our 60 second overview.

Nov 17, 2023 By Interlink In Interlink

Watch this short video to learn how Interlink’s Service Chain Mapping solution transforms the ability of banking and finance organizations to address regulatory demands, manage operational risk, and avoid technology failures that could disrupt key customer journeys.

View Video

Interlink

Read more about Strengthen operational resilience with Service Chain Mapping. Watch our 60 second overview.

Status Pages and Incident Management for SaaS Companies

Nov 17, 2023 By StatusCast In StatusCast

Explore the critical importance of status pages and incident management for SaaS companies in our latest video. Learn how effective management enhances customer trust, minimizes downtime, and ensures a resilient and successful SaaS operation. Don't miss out on valuable insights to optimize your service delivery and elevate customer satisfaction!

View Video

StatusCast

Read more about Status Pages and Incident Management for SaaS Companies

New Features: AI-assisted postmortems, ilert Terraform updates, and expanded ChatOps capabilities

Nov 17, 2023 By Daria Yankevich In iLert

In incident management, staying ahead of the curve is crucial, and that's what we're doing with our latest suite of features designed to streamline your workflow and enhance your response capabilities. Furthermore, you have provided numerous excellent suggestions during this period. We value your feedback and invite you to reach out to us at support@ilert.com to share your experiences with ilert.

Read Post

iLert

Read more about New Features: AI-assisted postmortems, ilert Terraform updates, and expanded ChatOps capabilities

Incident Priority Matrix: A Comprehensive Guide

Nov 17, 2023 By Anjali Udasi In Zenduty

When multiple users are affected by an incident, it can quickly escalate into a chaotic situation. To effectively manage and prioritize such incidents, organizations need a robust incident priority matrix. An incident priority matrix is a tool organizations use to deal with critical issues quickly. It’s a roadmap for handling incidents efficiently.

Read Post

Zenduty

Read more about Incident Priority Matrix: A Comprehensive Guide

What is Vulnerability Management?

Nov 16, 2023 By Gilad Maayan In OnPage

Vulnerability management is a critical aspect of a cybersecurity strategy. It refers to the systematic and ongoing process of identifying, classifying, prioritizing, and addressing security vulnerabilities in a network environment. This proactive approach to network security aims to minimize the risk of exploitation by attackers. Vulnerability management is about staying one step ahead of potential threats.

Read Post

OnPage

Read more about What is Vulnerability Management?

New Study Finds 93% of People Prefer Speaking with a Human Rather than a Chatbot

Nov 16, 2023 By Amberly Janke In PagerDuty

PagerDuty’s 2023 Holiday Shopping Report: Online shopping will be about the same as last year — top frustrations include poor digital experiences, security, shipping, and tracking issues.

Read Post

PagerDuty

Read more about New Study Finds 93% of People Prefer Speaking with a Human Rather than a Chatbot

Security - A Pillar of Reliability

Nov 16, 2023 By Emily Arnott In Blameless

When you think about making your service reliable, what standards and benchmarks are most important? The availability of services? Consistently fast responses? Accurate data? Prioritizing critical and common use cases? These are all important and deserve some focus, but today we’ll put the spotlight on an often overlooked pillar: security. ‍ Cybersecurity incidents can be the most devastating types of incident for your organization.

Read Post

Blameless

Read more about Security - A Pillar of Reliability

Terraform Time | Automate incident response using PagerDuty Incident Workflows

Nov 16, 2023 By PagerDuty In PagerDuty

We will be delving into automating incident response common tasks using PagerDuty Incident Workflows via Terraform.

View Video

PagerDuty

Read more about Terraform Time | Automate incident response using PagerDuty Incident Workflows

Updates to PagerDuty Analytics

Nov 16, 2023 By PagerDuty In PagerDuty

Senior Product Manager Anojan Gunasekaran joins us to chat about what's new in PagerDuty's Analytics!

View Video

PagerDuty

Read more about Updates to PagerDuty Analytics

Unleash the potential of intelligent, context-aware automation with BigPanda and Ansible

Nov 16, 2023 By Adam Blau In BigPanda

Many ITOps organizations we speak with want a state of self-healing systems capable of identifying and resolving issues without human intervention. Thanks to the progress in AI and ML, AIOps has made significant advancements in areas that automate many of the steps involved with identifying and triaging incidents. We ask ITOps leaders why they aren’t taking the next step with auto-remediating incident response workflows.

Read Post

BigPanda

Read more about Unleash the potential of intelligent, context-aware automation with BigPanda and Ansible

Status Pages and Incident Management for Higher Education

Nov 16, 2023 By StatusCast In StatusCast

Elevate your higher education experience with StatusCast! Watch our exclusive system outage video to discover crucial insights and proactive strategies to ensure uninterrupted operations in the dynamic landscape of academia. Learn from real-life scenarios and gain valuable knowledge on maintaining system reliability, minimizing downtime, and enhancing the overall efficiency of your educational institution. Stay ahead in the digital age of higher education with StatusCast – because your institution's success depends on a robust and resilient IT infrastructure!

View Video

StatusCast

Read more about Status Pages and Incident Management for Higher Education

Incident communication best practices for an elevated user experience

Nov 15, 2023 By Arun Madhavan In Site24x7

Downtime is unavoidable, and incidents happen. Organizations need to be rapid and transparent in communicating incidents with their customers. Lack of timely communication can jeopardize the entire incident management process and increase user frustration. This guide provides rich insights into what incident communication is, why it's important, and best practices for effective incident management. What is an incident, and why is incident communication important?

Read Post

Site24x7

Read more about Incident communication best practices for an elevated user experience

Understanding intelligent alerts in ITOps and alert management best practices

Nov 15, 2023 By Amy Brennen In BigPanda

As an ITOps leader, you know managing enterprise IT can be challenging, with its mix of old and new, on-site and cloud-based systems. Closely monitoring each part of the system infrastructure and its many components is a constant struggle, forcing you and your team to juggle non-stop alerts and keep services up and running. How can you stop alert fatigue and gain clarity when alerts are incessant, unclear, and lack the necessary context? The answer lies in intelligent alerts.

Read Post

BigPanda

Read more about Understanding intelligent alerts in ITOps and alert management best practices

Tip of The Day : How to Best Use Incident Templates

Nov 14, 2023 By StatusCast In StatusCast

Welcome to Statuscast.com's latest video: "How to Best Use Incident Templates," hosted by our very own Director of Customer Experience Engineering! In this power-packed tutorial, Denise Joyal will guide you through the intricacies of optimizing your incident response using Statuscast's cutting-edge Incident Templates feature.

View Video

StatusCast

Read more about Tip of The Day : How to Best Use Incident Templates

Incident management really can be for everyone

Nov 14, 2023 By incident.io In Incident.io

Incident management tools are often built for engineers to solve technical issues. On the surface, thinking of incident management as an engineering problem makes sense, and it’s an approach that’s widely used by many organizations from small startups to large enterprises. When there's a problem like a checkout page failure or a server crash, it’s natural for engineers to spring into action, declaring and resolving these incidents.

Read Post

Incident.io

Read more about Incident management really can be for everyone

From Chaos to Actionable Insights with PagerDuty Integrations and Automation

Nov 14, 2023 By Tiago Barbosa In PagerDuty

It’s 2023. In today’s world, every company and individual, regardless of their industry, relies on software to increase productivity. Our users expect our technology to be available and reliable at all times. If your software serves businesses within a single country during regular working hours, they expect it to be available throughout that time. Easy, right?

Read Post

PagerDuty

Read more about From Chaos to Actionable Insights with PagerDuty Integrations and Automation

ilert ChatOps: Create alerts in Slack

Nov 14, 2023 By iLert In iLert

ilert integration for Slack lets you create alerts directly within Slack and streamlines your incident management process, making it even easier for your team and stakeholders to report incidents.

View Video

iLert

Read more about ilert ChatOps: Create alerts in Slack

A tool rationalization head start with BigPanda

Nov 14, 2023 By Nathan Bao In BigPanda

Tool rationalization, sometimes called tool consolidation, is the systematic analysis of observability and monitoring tools, the consideration of onboarding new tools to fill gaps, and the retirement of unnecessary tools. Perhaps you and your IT team are struggling with constantly buying new tools to meet a very niche use case to unlock new capabilities.

Read Post

BigPanda

Read more about A tool rationalization head start with BigPanda

Introducing Workflows: Enhancing Automation to Incident Response

Nov 13, 2023 By Sanjog Sandhu In Squadcast

Read Post

Squadcast

Read more about Introducing Workflows: Enhancing Automation to Incident Response

What is ServiceNow IT Operations Management - and how does it work with AIOps?

Nov 13, 2023 By Adam Blau In BigPanda

Is your company using ServiceNow IT Operations Management or considering using it? If so, you know the importance of enhancing the visibility of your IT infrastructure and services, protecting against service disruptions, and enhancing your company’s operational flexibility. In this blog, we’ll discuss how ServiceNow ITOM works, improves visibility across the entire IT infrastructure, and streamlines operations. We’ll also discuss how ServiceNow ITOM is better together with AIOps.

Read Post

BigPanda

Read more about What is ServiceNow IT Operations Management - and how does it work with AIOps?

7 Habits of Successful Generative AI Adopters

Nov 13, 2023 By Dormain Drewitz In PagerDuty

Generative AI is forecasted to have a massive impact on the economy. These headlines are driving software teams to rapidly consider how they can incorporate generative AI into their software, or risk falling behind in a sea-change of disruption. But in the froth of a disruptive technology, there’s also high risk of wasted investment and lost customer trust.

Read Post

PagerDuty

Read more about 7 Habits of Successful Generative AI Adopters

Keeping Stakeholders Notified of Incidents With Squadcast

Nov 10, 2023 By Chitra Bisht In Squadcast

How can Stakeholders like CEO (Chief Executive Officer), CTO (Chief Technology Officer), COO (Chief Operating Officer), other business units like Sales, Support etc. be kept in the loop while managing a critical incident?

Read Post

Squadcast

Read more about Keeping Stakeholders Notified of Incidents With Squadcast

OnPage Releases Healthcare-Focused Slack Integration

Nov 10, 2023 By Ritika Bramhe In OnPage

In the healthcare realm, the need for communication platforms that meet HIPAA standards is undeniable. Enter Slack, a popular collaboration platform armed with robust security features. However, the real game-changer emerges through the integration with OnPage. This isn’t just an upgrade in collaboration; it’s a transformative shift in critical communication within healthcare—a field where every moment counts.

Read Post

OnPage

Read more about OnPage Releases Healthcare-Focused Slack Integration

The Unplanned Show E20: LLM Observability w/Charity Majors & James Governor

Nov 10, 2023 By PagerDuty In PagerDuty

Large language models (LLMs) are foundational to generative AI capabilities, but present new challenges from an observability perspective. Hear from observability thought leader and CTO/co-founder of Honeycomb, Charity Majors, and developer-focused analyst and co-founder of Redmonk, James Governor in this discussion about LLM observebility as more organizations are building business critical features on LLMs.

View Video

PagerDuty

Read more about The Unplanned Show E20: LLM Observability w/Charity Majors & James Governor

The Unplanned Show, Ep. 21: Cultures of Automation with Jamie Vernon

Nov 10, 2023 By PagerDuty In PagerDuty

Automation (and AI) change the way teams work, which requires cultural change. Hear from Jamie Vernon on how he's built a culture of automation to improve operational efficiency and team satisfaction.

View Video

PagerDuty

Incident Management

Read more about The Unplanned Show, Ep. 21: Cultures of Automation with Jamie Vernon

How to Reduce MTTR: A Complete Guide

Nov 9, 2023 By Bhavyadeep Sinh Rathod In Motadata

Organizations striving to improve their operational efficiencies must know how to reduce MTTR as it plays a key role in today’s fiercely competitive business landscape. Customer satisfaction is a top priority for most businesses and late response to their queries or issues can have a negative impact. To track the response and resolution time, businesses measure their MTTR score. MTTR is a key metric that gives insight as to how much time an organization takes to resolve an incident or issue.

Read Post

Motadata

Read more about How to Reduce MTTR: A Complete Guide

Terraform Time - Enrich PagerDuty incident response leveraging Custom Fields

Nov 9, 2023 By PagerDuty In PagerDuty

We will use Terraform Custom Field resources to enrich PagerDuty incident response tasks.

View Video

PagerDuty

Incident Management

Read more about Terraform Time - Enrich PagerDuty incident response leveraging Custom Fields

How observability and AIOps work better together

Nov 9, 2023 By Conor Castronovo In BigPanda

If you’re juggling complex, cloud-based, containerized systems and aiming to meet high customer expectations, your old monitoring processes probably don’t cut it anymore. Increasing infrastructure complexity means you need to instrument more, log more, and monitor more. That leads to even more complexity. The answer is better observability, right? Yes and no. Observability and monitoring are critical, but they are only part of what you need for service awareness and availability.

Read Post

BigPanda

Read more about How observability and AIOps work better together

Captains Log: A first look at our architecture for Signals

Nov 8, 2023 By Robert Ross In FireHydrant

Welcome to the first Signals Captain’s Log! My name is Robert, and I’m a recovering on-call engineer and the CEO of FireHydrant. When we started our journey of building Signals, a viable replacement for PagerDuty, OpsGenie, etc, we decided very early that we would tell everyone what makes Signals unique, and what better way than to tell you how we’re building it (without revealing too much 😉). Let’s jump in.

Read Post

FireHydrant

Read more about Captains Log: A first look at our architecture for Signals

The New SEC Rules and You

Nov 8, 2023 By Emily Arnott In Blameless

The Securities and Exchanges Commission published new rules for SEC registrants around disclosing incident details and response policies. Compliance with these new rules should be top of mind for any company – even if your org hasn’t hit the milestone of registering with the SEC, you should be prepared to be compliant when you take that step. ‍

Read Post

Blameless

Read more about The New SEC Rules and You

Zenduty - Kubecon

Nov 7, 2023 By Zenduty In Zenduty

Zenduty is a revolutionary incident management platform that gives you greater control and automation over the incident management lifecycle.

View Video

Zenduty

Read more about Zenduty - Kubecon

xMatters Support - On-Call Groups

Nov 7, 2023 By xMatters In xMatters

On-call groups enable you to notify a set of users, devices, and other groups as a single recipient. An on-call group may be simple collections of members or complex shift schedules, escalation timelines, and rotations to allow you to notify the right people, at the right time.

View Video

xMatters

Incident Management

Read more about xMatters Support - On-Call Groups

The Unplanned Show, Episode 19: Cloud Security response with Ashley Ward

Nov 7, 2023 By PagerDuty In PagerDuty

As organizations move to the cloud, where is there overlap between security and IT and engineering? In this session, Dormain will sit down with Orca Security's Principal Technical Evangelist, Ashley Ward, to learn about how working practices have to evolve with the speed of change in the cloud.

View Video

PagerDuty

Read more about The Unplanned Show, Episode 19: Cloud Security response with Ashley Ward

What you need to know about the The Digital Operational Resilience Act (DORA)

Nov 7, 2023 By Sirine Karray In iLert

The European Commission has introduced the Digital Operational Resilience Act (DORA) to bolster the digital infrastructure of the financial sector within the European Union (EU). As part of the EU's wider digital finance strategy, DORA's objective is to create a comprehensive framework governing digital operational resilience. Financial institutions must ensure full compliance with DORA by January 2025.

Read Post

iLert

Read more about What you need to know about the The Digital Operational Resilience Act (DORA)

Mastering Root Cause Analysis: A Guide for Site Reliability Engineers

Nov 7, 2023 By Anjali Udasi In Zenduty

Site Reliability Engineers (SREs) play a vital role in ensuring the stability and performance of web services and are key in incident management. One of the core skills SREs need is the ability to conduct effective Root Cause Analysis (RCA) when issues arise. This guide is about how to improve your RCA skills for more effective post-incident analysis.Let's dive in.🔖 What is Prometheus Alertmanager? Read here!

Read Post

Zenduty

Read more about Mastering Root Cause Analysis: A Guide for Site Reliability Engineers

What is IT incident management - and how can AIOps optimize it?

Nov 6, 2023 By BigPanda In BigPanda

Imagine you’re in the middle of a critical project, and suddenly, your system crashes. Or perhaps it’s the middle of the night, and your server goes down, affecting countless users. Some IT incidents are inevitable, but the way you manage them makes all the difference in minimizing their impact. You know that proper incident management is critical – and that incidents can become costly.

Read Post

BigPanda

Read more about What is IT incident management - and how can AIOps optimize it?

How we manage incidents at Datadog

Nov 6, 2023 By Laura de Vesine In Datadog

Incidents put systems and organizations to the test. They pose particular challenges at scale: in complex distributed environments overseen by many different teams, managing incidents requires extensive structure and planning. But incidents, by definition, break structures and foil plans. As a result, they demand carefully orchestrated yet highly flexible forms of response. This post will provide a look into how we manage incidents at Datadog. We’ll cover our entire process.

Read Post

Datadog

Read more about How we manage incidents at Datadog

The Journey Into Automation: Optimizing Care Delivery

Nov 6, 2023 By Ritika Bramhe In OnPage

In a world where efficiency and precision are the cornerstones of progress, automation has become the unsung hero across diverse industries. From manufacturing floors to customer service, its transformative power has reshaped the way we work and deliver services. Today, we embark on a journey to explore the profound influence of automation on healthcare, where each automated process is a progressive step towards optimizing care delivery and reshaping the future of patient-centered care delivery.

Read Post

OnPage

Read more about The Journey Into Automation: Optimizing Care Delivery

xMatters Support - Broadcast Groups

Nov 6, 2023 By xMatters In xMatters

In xMatters, groups determine how and when people are notified using on-call schedules, escalation timelines, and rotations. But what if you don't use complex on-call schedules, or need to notify all members of the group simultaneously? Broadcast groups make it easier for customers who don't always need on-call schedules. Let’s take a look.

View Video

xMatters

Incident Management

Read more about xMatters Support - Broadcast Groups

ilert ChatOps: Create alerts in Microsoft Teams

Nov 6, 2023 By iLert In iLert

ilert integration for Microsoft Teams lets you create alerts directly within Microsoft Teams and streamlines your incident management process, making it even easier for your team and stakeholders to report incidents.

View Video

iLert

Read more about ilert ChatOps: Create alerts in Microsoft Teams

Suppressing Alert Noise during Scheduled Maintenance

Nov 3, 2023 By Chitra Bisht In Squadcast

Alert noise is a common problem for IT teams that monitor and manage complex systems. Excessive unactionable alerts triggered by various sources, such as applications, servers, network devices, etc., can cause alert fatigue. The higher volume of alerts can be overwhelming, reducing the ability to respond to critical alerts. One event of possible alert noise is during scheduled maintenance, awhich is a common practice in the digital realm.

Read Post

Squadcast

Read more about Suppressing Alert Noise during Scheduled Maintenance

6 Best Practices for Tuning Network Monitoring Alerts

Nov 3, 2023 By Nolan Greene In iLert

Network monitoring and alerting provide the foundation for efficient IT operations and cyber resilience. By keeping track of the status and performance of network infrastructure and applications, network monitoring tools can automatically generate alerts when defined thresholds are exceeded or specific events occur. These network monitoring alerts allow IT teams to detect outages, performance degradation, and potential security incidents so they can respond swiftly to minimize disruption.

Read Post

iLert

Read more about 6 Best Practices for Tuning Network Monitoring Alerts

Taking down (and restoring) the Raygun ingestion API

Nov 2, 2023 By Vishakh Nair In Raygun

In a world where Software as a Service (SaaS) products are integral to daily life, maintaining uninterrupted service for end-users is paramount. However, stuff happens. When it does, our most valuable response (other than restoring service ASAP) is to review the series of events that led up to the incident and learn from them. On August 25th, 2023, at 7:02 AM NZT, Raygun experienced a significant incident that impacted our API ingestion cluster, leading to an outage lasting approximately 1 hour and 15 minutes. While this wasn't fun for anyone involved, this incident did prove to be a valuable learning experience, shedding light on the importance of infrastructure management and resilience.

Read Post

Raygun

Read more about Taking down (and restoring) the Raygun ingestion API

Status Pages That Deliver: Top 10 Favorites

Nov 2, 2023 By Chitra Bisht In Squadcast

Status Pages represent an invaluable asset for websites and SaaS businesses, particularly in today's environment with prevalent outages and heightened user expectations for seamless uptime. Integral to any robust website monitoring strategy, these pages serve as centralized hubs, offering users a singular, authoritative source for tracking the status of websites and applications.

Read Post

Squadcast

Read more about Status Pages That Deliver: Top 10 Favorites

Status Pages 101: How to Create a Status Page You and Your Customers Will Actually Want to Use

Nov 2, 2023 By Ashley Sawatsky In Rootly

This blog post is adapted from my talk at SRECon EMEA 2023 - original slides are available here! Status pages are a simple yet underutilized element of incident communication. Done well, they’re a low-lift way to keep your customers and stakeholders informed when incidents impact them. But without a solid approach, updating status pages can easily become a tedious and often neglected task during incidents. In this post, we’ll cover some tips to get your status page right.

Read Post

Rootly

Read more about Status Pages 101: How to Create a Status Page You and Your Customers Will Actually Want to Use

PagerDuty and Jeli Together Will Transform Incident Management

Nov 2, 2023 By Dan McCall In PagerDuty

Today is an important day for us at PagerDuty, and for the larger ecosystem of incident management. We’ve signed a definitive agreement to acquire Jeli, a standout player in the incident management space. This deal represents a strategic alignment of visions, technologies and goals that will have a lasting impact on the industry and our customers.

Read Post

PagerDuty

Read more about PagerDuty and Jeli Together Will Transform Incident Management

Stop aiming for a 'perfect' monitoring and observability strategy - and start using AIOps

Nov 2, 2023 By Amy Brennen In BigPanda

Change is the only constant in today’s continuously shifting IT landscape. Whether you’re adding new observability tools, retiring existing monitoring systems, establishing new business units, or onboarding IT systems from acquisitions, managing these non-stop changes can challenge even your expert ITOps team. Trying to get your monitoring house in order is a daunting task.

Read Post

BigPanda

Read more about Stop aiming for a 'perfect' monitoring and observability strategy - and start using AIOps

Basics of Incident Management

Nov 1, 2023 By Kaushik Thirthappa In Spike

Life is full of unexpected incidents. From the coffee spill that disrupts your morning routine to the sudden traffic jam that transforms a 20-minute commute into an hour-long ordeal. Much like these challenges, most of our systems and infrastructure also constantly face these tiny glitches. If ignored, they can have a significant impact. Unlike minor inconveniences, these glitches we call Incidents have the potential to disrupt your business, frustrate customers, and eat into your revenue.

Read Post

Spike

Read more about Basics of Incident Management

Set Responders Up for Success with New User Onboarding

Nov 1, 2023 By Cristina Dias In PagerDuty

Effective incident response plays a critical role in maintaining smooth operations at organizations of all sizes. When built up correctly, operational resilience–that ability to bounce back quickly after failure–can act as a shield that guards your customer experience, ensuring that even when incidents inevitably happen, you’re back online in no time.

Read Post

PagerDuty

Read more about Set Responders Up for Success with New User Onboarding

Operations | Monitoring | ITSM | DevOps | Cloud

November 2023