Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

New BigPanda features accelerate IT incident response

ITOps teams are inundated with a significant volume of alerts each day. Sifting through these alerts to discern which ones are harmless and which could lead to major incidents is a time-consuming and tedious task. This process often involves hunting for information across disparate data sources, tools, and workflows. As a result, the investigation can slow down incident response times, negatively affecting service reliability and customer satisfaction.

3 Ways to Streamline Kubernetes Operations with PagerDuty Automation

Kubernetes popularity continues to grow, with over 60% of organizations maintaining multiple Kubernetes across diverse environments and teams in some capacity. However, as clusters multiply, so do operational challenges: from monitoring hundreds of microservices to responding to and escalating incidents across distributed systems.

Building an AI Chatbot Playground with React and Vite

Read how we set up an experimental chatbot environment that allows us to switch LLMs dynamically and enhances the predictability of AI-assisted features' behavior within the ilert platform. The article includes a guide on how you can build something similar if you plan to add AI features with a chatbot interface to your product.

A Beginner's Guide To Service Discovery in Prometheus

Service discovery (SD) is a mechanism by which the Prometheus monitoring tool can discover monitorable targets automatically. Instead of listing down each and every target to be scraped in the Prometheus configuration, service discovery acts as a source of targets that Prometheus can query at runtime. Service discovery becomes crucial when there are dynamically changing hosts, especially in microservices architectures and environments like Kubernetes.

Top 5 outages detected by StatusGator in October 2024

StatusGator’s Early Warning Signals alerted customers to several notable service outages in October 2024. With advanced warning, our users could take proactive measures, minimizing the impact of downtime on their businesses. Here’s a summary of how our detection gave customers an edge over service disruptions, often notifying hours or minutes before the provider even acknowledged the issue.

Incident Response Automation: How It Works & Why It Speeds Up Resolutions

The speed at which you respond to incidents can make or break user satisfaction, team morale, and business continuity. Whether it’s a server crash, a security breach, or a software bug affecting users, rapid and efficient incident management is key to maintaining a strong reputation and minimizing operational downtime. And while traditional manual responses have worked in the past, automated incident response is now paving the way for faster, smarter, and more efficient handling of these issues.

Demo Roundups! Automation Standardization (Workflows)

Join PagerDuty’s Solutions Consultants Bobby Zimmerman and Justyn Roberts to discover how combining technical automation with human-driven processes can reduce manual interventions, streamline repetitive tasks, and increase operational efficiency. Level up your digital operations expertise with PagerDuty Demo Roundups — a series of live, interactive webinars where you can deepen your knowledge in the Operations Cloud and see how PagerDuty can work for you. Each 1-hour session presents a hands-on demo that showcases PagerDuty’s capabilities in real-time followed by Q&A.

Site Reliability Engineer's Guide to Black Friday

It’s gotten to the point where Black Friday reliability prep has to start on…well Black Friday. This year, 32% of consumers in the US claimed that they were going to start their holiday shopping in July-October. Plus, Black Friday isn’t the only day eCommerce businesses have to worry about, now we have Cyber Monday, Travel Tuesday, and the thousands of Prime Days from Amazon.