Operations | Monitoring | ITSM | DevOps | Cloud

February 2023

Sponsored Post

Reducing Security Incidents: Implementing Docker Image Security Scanner

Are you utilizing Docker to deploy your applications? If so, you're not alone. The use of Docker has skyrocketed in popularity in recent years. While it offers numerous benefits, it also introduces new security risks that need to be addressed. But, why is reducing security incidents so important? Simple - the cost of a security breach can be devastating. From lost customer trust to financial losses, the consequences of a security incident can be severe. That's why it's crucial to take steps to prevent them from occurring in the first place. Enter Docker image security scanners.

S1E1: Maximize service uptime with efficient incident management workflows - Masterclass 2023

In this episode of Masterclass 2023, we'll cover how IT service management teams can utilize ServiceDesk Plus to quickly handle incidents and streamline the incident resolution process in a hybrid work setting. You'll learn how ServiceDesk Plus can enhance the effectiveness of incident management practice through collaboration, dynamic template creation, automation, and more. Useful resources Follow us on social.

RedSky & Mass Notification: Enhance Your E911 Response

It’s your duty to empower on-site responders with the tools they need to gain situational awareness, speed, and coordination to potentially save lives during an incident. RedSky E911 now includes Everbridge Mass Notification with Incident Communications to provide the ultimate emergency notification and alerting capabilities to help protect your employees that are using your Multi Line Telephone System (MLTS), as required by Federal law.

Incident Workflows

Time is of the essence when responding to incidents and within seconds all the right responders need to be mobilized and the right stakeholders informed. PagerDuty Incident Workflows empowers teams with sophisticated automation capabilities to reduce the manual work required to escalate and mobilize team members. Using if-this-then-that logic on our no-code/low-code builder you can orchestrate and automatically trigger the right set of incident actions for your needs at any time.

Taking the fear out of migrations

Over the last 18 months at incident.io, we’ve done a lot of migrations. Often, a new feature requires a change to our existing data model. For us to be successful, it’s important that we can seamlessly transition from the old world to the new as quickly as we can. There are few things in software where I’d advocate a ‘one true way,’ but the closest I come is probably migrations. There’s a playbook that we follow to give us the best odds of a smooth switchover.

S1E1: Maximize service uptime with efficient incident management workflows [Cloud]

In this episode of Masterclass 2023, we'll cover how IT service management teams can utilize ServiceDesk Plus Cloud to quickly handle incidents and streamline the incident resolution process in a hybrid work setting. You'll learn how ServiceDesk Plus Cloud can enhance the effectiveness of incident management practice through collaboration, dynamic template creation, automation, and more.

[PODCAST] Season 2 - Episode 1 The ITOps 2023 predictions; what does the future hold.

What will 2023 hold for ITOps? As we look back to 2022, its stellar growth for many companies and positive hiring trends, we hope that 2023 is even more successful for those involved in ITOps. In this episode, we take a deep dive into #predictions for 2023 and the future of #ITOps.#aiops #ITOps #podcast

[PODCAST] Season 2 - episode 3 - Resolving unforseen ITOps events

Even the best teams can encounter outages. Sometimes there's environmental anomalies in the data center or a component failure that leads to unplanned downtime. In this episode, we explore how IT teams can limit the impact of outages to business operations and resolve them when they arise.#itops #aiops #podcast

Game Day: Stress-testing our response systems and processes

At incident.io, we deal with small incidents all the time—we auto-create them from PagerDuty on every new error, so we get several of these a day. As a team, we’ve mastered tackling these small incidents since we practice responding to them so often. However, like most companies, we’re less familiar with larger and more severe incidents—like the kind that affect our whole product, or a part of our infrastructure such as our database, or event handling.

Webinar on 'Evolution of Incident Management from On-Call to SRE' | Squadcast

This Incident Management has evolved considerably over the last decade, more so in the last few years. What was traditionally limited to having just an in-house on-call team and an alerting system, has now grown well beyond that to ensure Automation, Collaboration, Transparency, and Retrospection are deeply entrenched in Incident Response.
Sponsored Post

Areas to Streamline Incident Management

When a serious incident occurs, time is essential. Streamlining different components of the incident response and management process can help minimize the time it takes to resolve an incident. Proper streamlining also helps reduce downtime, restore functionality, and potentially curtail the overall impact of an incident-not to mention the costs incurred during these events. This article examines several areas of incident management, the potential challenges of manual implementation, and how an automation platform can alleviate these challenges to provide a streamlined incident response process.

6 Must-Have Features of an Alert Notification Software

Alert notification software is an essential tool for IT operations, as it enables teams to quickly respond to critical issues and ensure the smooth running of systems and services. With the increasing complexity of IT environments, it is more important than ever to have a robust alerting system in place. General robustness is essential as such alert notification system will quickly become an essential part of your operation stack.

Incident Management KPIs - what really matters

In the age of Big Data and analytics, companies are increasingly using the power of numbers and data to improve their processes. In the incident management world, this means turning to KPIs, metrics, and other incident monitoring methods to recognize trends and take corrective action. ‍ To manage and improve your incident management processes, you have to keep an eye on KPIs and metrics.

How to choose the right Incident Management software?

Software programs known as incident management solutions assist organizations in managing occurrences, tracking and monitoring incident response activity, and evaluating the performance of their incident response teams. They are crucial to any organization’s incident response strategy and can aid teams in coordinating their efforts, getting in touch with key stakeholders, and preserving their work.

"Avoiding Catastrophic Outages" | DeveloperWeek 2023

In this talk, Andrew Zigler (Developer Advocate at Mattermost) discusses root causes of catastrophic outage, and approaches to prevention using open source technologies you can deploy in less than a day. He'll talk through real-life case studies from manufacturing plants to global media companies to the world's largest banks and other mission-critical technical teams.

How to untangle monitoring noise and leverage observability best practices

Most organizations suffer from some form of alert noise, shares Adam Blau, senior director of product marketing at BigPanda. “Alert noise is only going to increase as organizations support cloud-native applications spanning multiple public and private clouds, including ephemeral deployments and more. It’s not going to get easier for organizations to understand the signal from all those alerts being sent,” Blau said.

Reduce IT costs without increasing incidents and escalations

As technology in business continues to evolve, IT costs can quickly add up. Companies may be looking for ways to reduce IT costs while maintaining a high customer service level. This article will discuss the potential benefits of lowering IT costs without increasing incidents and escalations. We will explore strategies to reduce IT costs, improve customer service, and increase employee productivity.

IT (Information Technology) Alerting Software

IT support engineers rely on many specialized monitoring tools to detect infrastructure, application, and security problems. Once a monitoring tool detects a problem, it alerts must notify support to start incident response. Many complexities arise after the alert is sent. AlertOps offers many alert management features.

Private Status Pages Are The Key To Effective Incident Management

The IT team for a large organization plays a crucial role in ensuring the smooth operation of the company’s technology infrastructure. One important aspect of their job is incident management, which involves identifying, assessing, and resolving issues that arise with the technology systems. IT teams utilize status pages to interface with end-users in order to inform them of system status, downtime and maintenance.

Zenduty - Tutorial 15 - Zenduty API and Postman Collections

Zenduty is a revolutionary incident management platform that gives you greater control and automation over the incident management lifecycle. With the Zenduty API, you can supplement and deploy Zenduty in sync with other tools and services, allowing you to create and update incidents, users, teams, services, integrations, schedules etc. and automate your workflows using simple scripts.

Deduplication Rules | Reduce Alert Noise by Clustering Similar Alerts I Squadcast

Alert Deduplication can help you reduce alert noise by organising and grouping alerts. It also provides easy access to similar alerts when needed. This video on Alert Deduplication rules will help you define Deduplication Rules for each Service in Squadcast. Alerts will get deduplicated when these rules evaluate true for an incoming incident.
Sponsored Post

Incident Management: Tips for Tech Companies

A seemingly straightforward technical problem can often have explosive consequences. Say a tech team restarts a cloud server overnight; those few minutes of downtime might trigger a problem elsewhere and cause your app to crash. The following morning, customers can't access your services, you're trending on social media for all the wrong reasons and your customer service reps are left to pick up the pieces. Scenarios like this prove the value of incident management. But you need best practices that ensure incident management does what it's supposed to do. Otherwise, it's just another buzzword. Here are some best practices for incident management that you need to incorporate into your tech organization.

5 tips for a successful on-call duty

On-call availability is crucial for many industries, especially in IT. With the growing reliance on IT systems and services, their availability directly impacts the success and satisfaction of customers. To ensure round-the-clock availability, on-call services are vital for prompt responses to emergencies and issues.

Four ways tech will evolve in 2023

Will artificial intelligence (AI) end up emphasizing the importance of human emotions? What’s next for company operating budgets? And is a reckoning coming for managed service providers (MSPs)? In a recent episode of our That’s great IT podcast, we invited an expert panel to discuss all of this and more. The panel consisted of three returning guests: They shared the top IT trends they’ve seen in their industries and how they expect those trends to play out in 2023.

The Fundamentals of Enterprise Incident Management

In the world of enterprise major incident management, integrating partial or full automation across each stage of the incident response and management lifecycle makes a big difference to the speed incidents are addressed and the data you have to understand them afterward. Gartner coined the term “Incident Response Automation” in its 2020 report Automate Incident Response to Enhance Incident Management.

Why Clearco switched to Grafana Alerting, Grafana OnCall, and Grafana Incident

Working with technology means dealing with incidents or outages from time-to-time, so staying on top of problems is essential. Back in the spring of 2022, Clearco, the world’s largest e-commerce investor, had an alerting system set up to catch issues, except they had one problem: Clearco’s Customer Success team would learn of a problem before a notification even went off.

Preventing Outages in 2023

The outages span the giants of the Internet and some of the biggest failures of IT resilience we were subject to – from AWS’s trifecta of outages in December 2021 to the October ‘21 outage that took down Facebook, Instagram, WhatsApp, and interrelated services. We also look at some more intermittent outages that you may have missed.

PagerDuty Mobile: Stay ahead of incidents, anywhere, anytime

Experience an all-in-one app for viewing, managing, and responding to critical incidents with PagerDuty Mobile. It gives you immediate access to incident details, service information, and recent change events. You can easily set up Slack channels and video conferences for streamlined incident response through incident workflows. So you can deliver faster time to resolution and focus more time on what matters the most.

Integrating Slack & Squadcast- Trigger, Acknowledge, Resolve & Reassign incidents from Slack channel

You can integrate Squadcast and Slack to collaborate efficiently with your team while working on incidents. Squadcast sends a notification to the configured Slack Channel as soon as an incident is triggered.

Making transparency a principle of your company's culture

You’ve probably heard the phrase “transparency is key” more than you can bear at this point—so let’s get this out of the way. Transparency is key. The phrase suddenly became that much more unbearable. But before you drop off, let me also communicate something else: transparency is often not enough. Often, companies make the mistake of leaning on transparency as a catchall solution to many of their internal comms issues.

Top 3 ways to successfully create and defend your IT budget

Budgets are a touchy subject for anyone, and there’s no one-size-fits-all approach. However, the work ITOps does is integral to the success of your organization, so being confident in building and defending your budget is crucial to getting the resources you need. So what does success look like when it comes to ITOps budgets? In our recent podcast episode from our series, That’s great IT, I sat down with global IT leader Nigel Peacock to discuss the best ways to justify your ITOps budget.

How to Use Big Data to Your Advantage

Users have been generating increasing amounts of data in the past few years, partly due to rapid digitalization since the pandemic. As a result, increasing numbers of analytics applications are capitalizing on these data assets. However, building scalable systems is no trivial task and incidents are inevitable. Complex systems generate data in the form of logs, traces, metrics, and more, which organizations often find themselves sprinting through. Such logs are a powerhouse of valuable information.

Creating Tickets in Jira From Squadcast I Jira Integration (Cloud & Server) I Squadcast

This video will help you install and configure the Squadcast extension for Jira Cloud & Jira Server. It will help you create tickets in Jira projects whenever there is an incident in Squadcast. Also, learn to automatically or manually sync the status bidirectionally.

Integrating Microsoft Teams & Squadcast - Acknowledge, Resolve & Reassign Incidents | Squadcast

Teams using MS Teams can now integrate with Squadcast and easily Acknowledge, Resolve & Reassign incidents using MS Teams. You can configure Squadcast to send a notification to the configured MS Teams channel as soon as an incident is triggered.

Types of Incident Retrospective Templates

When an incident occurs, it's important to take the time to review what happened, understand all the contributing factors, and identify systemic changes to prevent similar incidents from happening in the future. This process is known as an incident retrospective. However, conducting incident retrospectives can be time-consuming and difficult, especially when dealing with multiple stakeholders and a large amount of data.

Take the "work" out of your incident workflow: Integrating Blameless with Opsgenie

Assemble the right team for incident management fast with the new bidirectional integration of Blameless and OpsGenie. In this 30-minute live webinar, Blameless's Aaron Lober, Paul Chu, and Nicolas Philip show you how to seamlessly connect your alerting and service registry to your incident response processes. Webinar includes a live demo.

Webinar: The 2023 ITOps forecast

Tech saw a lot of challenges in 2022. ITOps, NOC, and SRE teams grappled with shifts in staffing, a disappearance of those with tribal knowledge, a continuing transformation of consumer spending habits, and a general disruption of workplace culture. So what will 2023 look like for the industry? Likely, more volatility—but our panel of industry experts are here to help you navigate the choppy waters while also making some bold predictions. Change is the only constant in the tech sector.

CommsFlow Messaging Templates | Blameless

Effective communication is critical during incidents. In order to minimize the impact of an incident and resolve it quickly, it's important that all stakeholders are kept informed and updated throughout the incident response process. However, communicating during an incident can be challenging, especially when dealing with multiple stakeholders and a high level of stress. On-call engineers can have their focus disrupted by switching out of their diagnostic tools to issue communications.

Why AIOps is Worth the Investment During an Economic Downturn

Recent talks of an economic softening have left IT leaders concerned about the future of their enterprises. That concern is understandable — tech layoffs create near-daily headlines at this point, with top companies rolling back their operations and rolling up their sleeves to focus on mission-critical expenses. And for many in ITOps, that means cutting tools.

Incident Workflows with Sam Ferguson

PagerDuty’s new Incident Workflows feature will help your teams build powerful, flexible incident response processes customized to your organization’s needs. Add Slack channels, Zoom calls, responding teams, and more. PagerDuty Senior Product Manager Sam Ferguson walks us through how this new featureset works and demonstrates some of the capabilities.

ServiceNow Integration - xMatters Integrations

Looking to extend the value of your existing applications? The xMatters and ServiceNow integration allows organizations to accelerate IT incident response, reduce downtime, and maximize service reliability. Learn some of the most popular ways you can utilize these two industry-leading platforms, including engaging resources and automated technical escalations!

Reporting Incident Using Webforms I Creating Alerts from Outside the Squadcast Ecosystem I Squadcast

Webforms can help stakeholders & the customers of an organization easily report issues. This video explains how users from outside the Squadcast ecosystem can report incidents by filling out a simple form and extend customer support by empowering internal stakeholders and customers to report issues on the go.

How To Setup Outgoing Webhooks in Squadcast | Recieving Incident Information | Squadcast

Webhooks allow you to connect a platform you manage (either an API you create by yourself or a third-party service) to a stream of future events. Setting up a Webhook on Squadcast enables you to receive information (referred to as events) from Squadcast as they happen. This can help you avoid continuously polling Squadcast’s REST APIs or manually checking the Squadcast web/mobile application for desired information.

How to Set up SLOs and Configure SLIs in Squadcast | Tracking Error Budget & Burn Rates | Squadcast

This video will help you define and monitor Service Level Objects for your services and also set up and track error budget burn rates in Squadcast. A Service Level Objective (SLO) is a reliability target, measured by a Service Level Indicator (SLI), and sometimes serves as a safeguard for a Service Level Agreement (SLA). SLOs represent customer happiness and guide the development team’s velocity.

Quick! Grab all the evidence: Capturing application state for post-incident forensics.

Everyone loves a good mystery thriller. Ok, not everyone – but Hollywood certainly does. Whether it’s Sherlock Holmes or Hercule Poirot, audiences clearly enjoy a page-turning plot of hunting down the culprit for some heinous crime.

5 Best practices for developing a culture of continuous improvement

How do you create a great engineering team? Exclusively hire brilliant, tenured computer science PhDs. There we solved it. You can skip the next 400 words. (I can hear my college professor in my head saying “Humor might not be your strong suit”) Building a great engineering team isn’t easy. Understatement of the year. It’s not even a problem to be solved per se. We need to think about it as preparation to solve an infinite set of constantly evolving problems.

Knightscope Relies on PagerDuty to Keep Their Robots Rolling

As security becomes more advanced and available, companies must look for ways to be more efficient with their resources in order to stay competitive. With challenges that limit the capabilities of companies, such as limited employee resources and low customer tolerance for delays in services, reliable and affordable solutions are necessary. In this case, it means disrupting the traditional security industry. Organizations are achieving their goals by relying on automation and technology.

Best Practices for Managing Incidents at Varying Severity Levels

A software incident is an event or unplanned interruption that causes the software to deviate from its intended behavior, affecting the quality of service. With the ever-changing nature of the software industry, incidents are inevitable, particularly in teams that practice iterative software development cycles with constant releases to production. This necessitates a robust incident management strategy.

3 examples of DevOps automation

Automating processes and the tools that enable them is vital for empowering highly productive teams. The right automation tools and workflows help DevOps and SRE teams minimize repetitive tasks, improve monitoring capabilities, enable continuous integration/continuous deployment (CI/CD), and work with massive volumes of data.

Suppression Rules in Squadcast | Minimise Alert fatigue | Suppress Non-Actionable Alerts | Squadcast

This video talks about Alert suppression in Squadcast. Alert Suppression helps you avoid alert fatigue by suppressing notifications for non-actionable alerts. Squadcast will suppress the incidents that match any of the Suppression Rules you create for your Services. These incidents will go into the Suppressed state and you will not get any notifications for them.

Your non-technical teams should be using incident management tools, too

For many businesses across the world, incident management is something that’s usually left to engineers. These teams are on the front lines, declaring, managing, and resolving all sorts of incidents across the org, regardless of where it originates or what form it takes. But there’s a glaring issue with this approach. Outside of technical teams, people across organizations aren’t accustomed or trained to use the word “incident” whenever an issue comes up.

Announcing our improved Schedules & On-Call Rotations

Hey folks! We are super excited to announce that our schedules feature has gone through a bit of an update. Well, more than a bit 🙂. We’ve gone through the feature with a fine-toothed comb and introduced a bunch of UI and functional improvements which we hope will help you achieve one thing: set up, edit and manage your on-call schedules at scale in a matter of minutes (Yes, that was three things but it was tough to condense it to ONE thing)

[SRE: From Theory to Practice] What's difficult about problem detection?

In this episode of FTTP, Kurt Andersen and Matt Davis are joined by Joanna Mazgaj and Laura Nolan to talk about the implications of and considerations for problem detection. Watch the full episode and hear them share personal stories about the types of challenges you might face. Ultimately, how do we explain and address the socio-technical concepts behind problem detection?

[SRE: From Theory to Practice] What's difficult about incident command?

Welcome back to our mini series of fireside chats with SRE experts talking about the realities of their day-to-day. Episode 2 gets intimate — What’s difficult about incident command? We invited Alyson van Hardenberg, Engineering Manager at Honeycomb.io, and Varun Pal, Staff SRE at Procore, to chat with Jake Englund and Matt Davis from the Blameless team. Watch the full conversation where they cover everything from methodologies and technical expertise to the human and social aspects of reliability engineering.

Using Tagging and Routing Rules in Squadcast I Incident Classification I Event Tagging I Squadcast

Event Tagging is a rule-based, auto-tagging system with which you can define customized tags based on incident payloads, that get automatically assigned to incidents when they are triggered. This video explains how to create Tagging rules for efficient Incident Classification.

Maximizing IT Company Success through Effective On-Call Support

Having your systems monitored by a reliable solution is important, but how do you ensure that the right people are informed about issues that arise? Identifying problems is the first step, but they also need to be routed to the appropriate individuals. Keep in mind that employees may not always be sitting in front of the dashboard. This means being available outside of normal working hours to quickly respond to emergencies and problems, including not only weeknights but also weekends and holidays.

Adding Incident Watchers in Squadcast | Incident Notifications and Updates | Squadcast

This video talks about Squadcast's Incident Watchers Feature. In Squadcast, any user/stakeholder can subscribe to an Incident and act as a Watcher for an incident. Incident Watchers can choose to receive notifications for all the updates of an incident. This allows any user/stakeholder to act as an observer of the incident, even if they are not active responders. You can customize your watch options for the incident and receive notifications only for those updates.

Common Incident Terminology

Operations, customer support, engineers and most groups use inconsistent language. This is a serious problem. Imagine NASA doing that with astronauts or a navy with ships talking to each other, but not using the same terms. Something very bad will happen. In our space of incident management, we use words like broke, failed, outage, doesn’t work, dead…all describing the same condition.

Make your ITSM more efficient with PagerDuty and ServiceNow

Putting PagerDuty between your monitoring systems, CI/CD systems—really, anything emitting events about your digital environment— and your ServiceNow CMDB opens the door for better event management and correlation, incident response automation, advanced analytics and more, helping you service distributed and central teams together for faster turnaround and better customer experience.

How to consolidate your incident response stack using PagerDuty

PagerDuty is a comprehensive incident response solution that unifies disparate tools into a single platform. This helps teams respond to incidents faster and more effectively while reducing operational costs. PagerDuty also supports a shift from manual, reactive incident management to an automated, proactive approach, making the incident response process more efficient and resilient.

Here's what to focus on when reviewing an incident

Incidents can be a bit noisy. Especially when it’s one of higher severity, there are a lot of moving parts that can make it difficult to come away with the information you want at a glance. And if you’re someone who isn’t necessarily tapped into the day-to-day of incident response, such as a head of a department or executive, you’ll want to be able to glean the most actionable information in just a few seconds without having to dig through dense documents.

Top 5 Tools for SRE 2023 (Updated)

Site reliability engineers (SREs) are involved in scaling systems and making them reliable and efficient for organizations. But SREs often fail to build system resiliency when they do not have the right tools at their disposal. In this post, we’ll uncover the top 5 tools for SRE that can be used to drive the reliability and stability of software systems. It also examines how SREs can use the tools to improve operations tasks and infrastructure processes.

Enterprise Alert 9.4.1 comes with fixes and the revised version of the sentinel connector app

In this release, we have addressed a number of bugs that were impacting the performance and functionality of the system. In the Kernel, we have resolved an issue where the broadcast was not being stopped after the first user acknowledged it. Additionally, we have fixed a crash that was occurring when loading component infos and an error log that was being generated when the Kernel started in suspended mode.

Announcing: Blameless + OpsGenie Integration

In the opening moments of an engineering incident, the most important aspect of a response plan is speed. Getting out of the gate quickly by leveraging automation to assemble the team can save precious moments during a critical engineering incident and make the difference between happy and unhappy customers downstream. This is why we’re excited to announce the integration of Blameless with OpsGenie.

Extend the Power of Your ServiceNow Application with PagerDuty for Customer Service

The last few years have led to an increasingly digital world. We are all online, streaming, shopping, or simply surfing. In this new world, customer experience is more critical than ever. Customers want things to work as seamlessly as possible, and when things go wrong, so goes their trust and business. The key priority for many businesses is keeping those systems running as smoothly as possible to keep customers happy and build their loyalty.