Operations | Monitoring | ITSM | DevOps | Cloud

September 2022

Putting HC3's Cyber Posture Recommendations into Practice

Of growing concern to both patients and the professionals who facilitate their care is the growing trend of healthcare organizations being preyed upon by cybercriminals. In the United States, recent political dialogue has brought special attention to patients’ privacy rights under HIPAA and the ongoing security of their records.

PagerDuty Apps for AWS + Automated Diagnostics Demo Highlights (3 min.)

"Reduce downtime and customer impact with service ownership while enabling teams to drive continuous improvement and innovation Learn about how you can modernize and optimize your operations with our enterprise-grade set of AWS integrations. Automate incident response with PagerDuty’s Runbook Automation and learn about our new set of AWS plugins and prebuilt jobs that make it easier to get up and running with auto-diagnostics."

What's New: Updates to Mobile, PagerDuty Process Automation Software & PagerDuty Runbook Automation, and More!

We’re excited to announce a new set of updates and enhancements to the PagerDuty Operations Cloud. Recent development and app updates from the product team include Incident Response, PagerDuty® Process Automation, as well as Community & Advocacy Events updates. We continue to help customers automate everywhere to optimize cloud operations and reduce the amount of issues escalated to other teams.

The Time is Now to Learn from Availability to Optimize Customer Experience

We’ve just launched our inaugural State of Availability Report and the results are sobering. We discovered that: We’d hoped that at this point in the global digital transformation, organizations had gotten further ahead with mastering availability but there’s still a long way to go.

Sponsored Post

How Adaptive Incident Management Gives You the Upper Hand

One of the great things about the TV detective Columbo was that he never made a hasty decision based on first impressions or appearances at a crime scene. It didn't matter how obvious it seemed to be who committed the crime (or how good the frame-up) was: Columbo always dug deeper into motives, opportunities, and methods to uncover who the guilty party was.

Product metrics @ incident.io, a year (and a half) in

We’ve been celebrating a few big milestones 🎉 at incident.io in the last few months. We were recently discussing product metrics (as you do for fun on a Friday afternoon 🤓) , and Lawrence was very surprised with a particular stat around the number of workflows that have been run using incident.io.

Got an incident? pull the Andon Cord

Andon Cord catapulted Toyota into 40 years of unprecedented quality and domination. What is Andon Cord and how did they do it? In the early 1900s, Taiichi Ohno architected and introduced Andon cord in Toyota's manufacturing plants. The problem: This costs a lot of money. Production costs have always been high. In 1984, it cost NUMMI $15,000 per minute. That's $42,758 in today's value.

BigPanda's new self-service tools are primed to make integration onboarding even faster

BigPanda supports inbound integrations for alert ingestion out of the box; however, many IT organizations have older, rarer or custom-built tools that require a little more work upfront. Fortunately, BigPanda’s recently announced Open Integration Manager and Email Parser aim to streamline integrating these kinds of monitoring tools with the BigPanda platform.

October is National Cybersecurity Awareness Month

It’s National Cybersecurity Awareness Month, and as a Cybersecurity Awareness Month Champion Organization, xMatters is proud to be actively participating. Since the National Cybersecurity Alliance started this initiative in 2004, the number of devices connected to the internet and the amount of time we spend interacting online has increased exponentially. The impact on our lives is so massive that it’s become hard to imagine what life would be like without our devices.

Defining and measuring your SLIs and SLOs

Customers expect that online services are available all the time. The truth is that outages happen to almost everyone because providing 100% service availability is challenging and costly. Creating reliable and profitable service is, amongst other things, finding the balance between application availability, costs and time to market. Faster feature delivery means less availability as constant changes to production may cause issues and introduce bugs.

Create and Manage Maintenance Windows Through PagerDuty Mobile App

In order to respond in real-time to urgent, critical digital incidents, on-call responders must be able to take action from anywhere. But when on-call responders become overwhelmed with alerts, they often just “ignore them” because they cannot tell the difference between a real alert and a false one.

Sponsored Post

What Are Runbooks and How Does It Apply to Network Operation Centers (NOCs)?

Much like in other production environments, the production of cloud services is based on and orchestrated by a plethora of tools-making part of cloud services' overall cloud infrastructure. Given how cloud services are as complex as they are intricate, a vast range of detailed steps need to be performed in a certain order for the production environment to run smoothly, whether it's carrying out maintenance procedures, updates and upgrades, or resolving issues to prevent downtime.

Featured Post

The Economic Crunch is Here: Time to Get AIOps Right

Economic warning signs are flashing, and organisations of all sizes are balancing the need for fiscal discipline and efficiency while fighting to retain customers, when a single negative interaction can send them running to a competitor. Business digital operations are more complex than ever, compounding the problem is that companies are still adapting to remote work and pandemic-driven digitisation. Our recent report confirms that delivery teams are facing increased pressures, unreasonable business demands, and higher rates of burnout.

Service Catalog: Simplifying Service Management and Ownership

With the adoption of cloud and microservices, modern IT infrastructures operate with a mesh of services that cater to multiple user requirements. It can get very difficult to simultaneously keep track of numerous services. A Service Catalog helps organize service-related information in a single pane, achieve end-to-end service ownership and get real-time performance insights.

Prepare Don't Panic! How to Build a Resilient Security Posture with Automation

Since the outbreak of the global COVID-19 pandemic, delivering exceptional digital customer experiences has become mission critical for businesses in a broad array of verticals. All too often, however, the race to build, change and deploy features increases the incidence of customer-impacting service disruptions. Today’s DevOps, SRE and operations teams are struggling to keep up. Yet the business’s ability to provide seamless and reliable digital services to customers has never been more important to its success.
Sponsored Post

Exploring PagerDuty Alternatives for Incident Response

Incident response refers to effectively responding to infrastructure issues and resolving them in the shortest time frame possible. Due to several loss-inducing high-profile outages over the last few years, organizations have sought to create rigorous processes with specialized tools to resolve incidents quickly and learn from their failures. As one of the first platforms to enter the incident response space, PagerDuty is a dominant player, but over the years, competing platforms have begun carving out their own niche in the incident response space.

What are the new stages of incident management?

Good communication is at the core of any incident management process, empowering stakeholders with the information they need to avoid lost productivity. Delivering the right message through the right channel to the right people across the enterprise is key – if you’re simply firefighting and communicating reactively, stakeholders will likely get frustrated.

Released: Better Uptime Integration

StatusGator has a wide a variety of use cases: from education to help desk to IT and managed services and DevOps, too. All corners of an organization depend on cloud services and StatusGator gives you visibility into the status of all of your vendors. We’ve heard over and over from our DevOps users that alerts and notifications for their teams are already centralized into a single incident management platform such as OpsGenie, PagerDuty, or FireHydrant.

Want to improve your incident response plan? Focus on better incident communication.

Resolving the incident is only half the battle when it comes to responding to incidents. For many teams, incident communication is an afterthought, leaving stakeholders inside and outside the organization guessing what happened. But ensuring that important information about the incident is disseminated clearly and quickly is essential.

Is being on-call a reason to quit?

“Well, that’s the job.” Have you ever heard that from your colleagues or bosses when it came to being on-call? Imagine you started a new job 3 months ago and were looking forward to it from the start. You are on-call one weekend a month and thought there wouldn’t be many incidents from Friday evening to Monday morning. But by now you’ve noticed how much being on-call duty actually stresses you out. You get restless as soon as your shift starts.

Digital Transformation: Cloud Migration Strategy Checklist

Companies moving their applications and services to the cloud is nothing new, but doing business there requires a solid cloud migration strategy. The list of things to consider is longer than you might think. Fortunately, xMatters has done it successfully and has helped its own customers move to the cloud too. In this article, Product Marketing Manager Erin Jones gives a checklist you can use to get your cloud migration right.

RESOLVE '22: Customer experience in action

Companies go to extreme lengths to provide their customers the best possible experience—and every company’s concept of what makes a good experience is different. In our RESOLVE ’22 panel Customer experience in action we sat down with Translation.com Director of Strategic Initiatives Sridevi Matukumalli and Akamai Senior Director Harish Menon to talk about this notion.

New features: multiple responders, escalation delay, IP filter for private status pages, edit uptime history

This post highlights some of the features and improvements that we have released in the last 3 months. If you want to submit your own ideas or vote on existing feature requests, you can now use our new public roadmap at roadmap.ilert.com. ‍

RESOLVE '22: Observability and AIOps sitting in a tree

In our first session from RESOLVE ‘22, we were honored to have Darren Boyd and Satbir Sran from the Incubator podcast and ink8r think tank talk observability and AIOps with BigPanda’s Aaron Johnson. Both panelists are part of communities adopting open standards, and they regularly consult with organizations about how they can improve IT Operations and overall performance.

Fast track video series: Accept and normalize monitoring events

The Open Integration Manager enables you to create custom inbound alert integrations through the configuration of a generic, out-of-the-box inbound integration rather than creating custom code. This self-service and intuitive approach streamlines the integration process, accelerates data flow, and enables faster time to value.

Our journey to become Powerful Incident Management platform

Over the last couple of years, Spike.sh has largely been a Simple Incident Management Platform helping engineering teams across the world. Our focus on simplicity has been well received by all of you and we couldn't be more happy about it. After speaking with users earlier this year, we quickly realised there is a lot we can do to help our responders and help them better than we currently are.

What the heck is an incident?

Incident management is easily one of the most annoying things anyone has to ever deal with. There will always be only a handful of people who would ever want to walk into the building on fire to mitigate. That’s the same with most engineering teams. Only a handful are willing to get in, find the root cause, and mitigate the incident.

Fast track video series: Integrate ticketing and messaging tools with BigPanda

With distributed IT Operations becoming the norm, most enterprise teams struggle with communication and collaboration within and across the organization. Without the proper tools, staying on top of incidents can be challenging, quickly resulting in outages taking longer to resolve. The overall effect: increase in downtime-related costs and decrease in performance and availability of services making mean time to resolve (MTTR) worse.

Tips to make your Retrospectives Meaningful

If done right, retrospectives can help you inspect past actions, help adapt to future requirements and guide teams towards continuous improvement. However, organizations find it difficult to adopt the right mindset to execute retrospectives effectively. This blog will help you understand what retrospectives are and provide valuable tips to make your retrospectives meaningful. This blog will cover,

Deminar: Achieving Operational Resilience Through Service Intelligence

When an issue occurs, the potential risk of losing revenue, damaging customer relationships, and upsetting employees depends on an incident resolver’s ability to quickly restore services before the business is impacted. Watch our deminar to learn how a Digital Operations platform with Service Intelligence integrates everything you need so your organisation can visualise incidents in real-time, gain greater insight into their root cause, and remediate issues faster with service-centric automation.

Identify and manage impacted customers with our new Zendesk integration

Customer support tickets are a key indicator of which customers are being actively impacted by an incident. Incident-related support tickets are an important component of impact assessment, incident prioritization, and effective stakeholder communications. FireHydrant's new Zendesk integration allows Enterprise tier users to: With our Zendesk integration you can streamline customer impact assessments and incident communications, resulting in reduced support response times and incident durations.

Fast track video series: Extracting alert data from emails using BigPanda

BigPanda's easy-to-use self-service Email Parser receives information in email form and converts the data into BigPanda alerts. This is ideal for monitoring tools and systems that do not support REST API, the email parser extracts alert data such as status and properties right from the email's subject or body without the need for custom code.

Building Workflows, Part 2 - the executor and evaluation

This is the second in a two part series on how we built our workflow engine, and continues from Building workflows (part 1). Having covered core workflow concepts and a deep-dive into the Workflow Builder in part one, this post describes the workflow executor, and concludes the series with an evaluation of the project against our goals.

Introducing Webforms - Involve end users directly into your Incident Management process

Over the years we’ve received requests from our customers for a feature that can enable their customers and their end users to create/ report incidents directly on Squadcast. To our valued customers - we heard you! We are excited to introduce Webforms to do exactly that. In the past, we’ve addressed the challenges pertaining to On-call processes and best practices that teams can implement.

What's difficult about problem detection? - Three Key Takeaways

Welcome to episode 4 of our webinar series, From Theory to Practice. Blameless’s Matt Davis and Kurt Andersen were joined by Joanna Mazgaj, Director of Production Support at Tala, and Laura Nolan, Principal Software Engineer at Stanza Systems. They tackled a tricky and often overlooked aspect of incident management: problem detection. ‍

What Makes a Perfect Incident Management Checklist? We Asked the Experts!

The perfect incident management checklist doesn’t need to be a fantasy. In fact, it shouldn’t be! The perfect incident management checklist should cover several topics, be broken down into bite-size sections, and help team members quickly identify tasks that fall under their responsibility. We asked our experts what should be included in the perfect incident management checklist. Here are their answers.

Building Workflows, Part 1 - Core concepts and the Workflow Builder

At incident.io, we’re building tools to help people respond to incidents, often by automating their organisations’ process. Much of this is powered by our Workflows product, which customers can use to achieve things like: Workflows as a product feature are incredibly powerful, and we’re proud of the value they provide to our customers. Behind-the-scenes, though, building something like workflows can be difficult.

How to drive better decision-making with reliability management

Almost every organization is going through digital transformation. According to IDC, direct digital transformation investment is growing globally at a compound annual growth rate of 15.5% and is expected to approach $6.8 trillion by 2023. Customers quickly embrace the benefits of a customer experience reshaped by technology. However, they have little patience when that technology doesn’t work as expected.

Managing Squadcast resources with our expanded Terraform provider

Hey folks! We’re excited to announce that we’ve vastly expanded the capabilities of our Terraform provider. Previously, our Terraform provider was limited to creating and managing services as a resource. We have now covered the entire spectrum of resources available on Squadcast right from creating and managing users, escalation policies and also managing SLO’s via our Terraform provider. What does that mean for you?

When Can A Service Not Be a Service?

If you’re familiar with PagerDuty, you probably associate it with alerts about technical services behaving in ways they shouldn’t. Maybe you yourself have been notified at some point that a service wasn’t available, was responding slowly, or was returning incorrect information. That’s the common use of a service in the PagerDuty platform.

Intro to Grafana Incident

In this video, you’ll learn how Grafana Incident offers a complete incident management process out of the box in Grafana Cloud, so you can save time and focus on what’s important when things go wrong. Grafana Incident is available to all free and paid Grafana Cloud users. If you’re not already using Grafana Cloud — the easiest way to get started with observability — sign up now for a free 14-day trial of Grafana Cloud Pro, with unlimited metrics, logs, traces, and users, long-term retention, and premium team collaboration features.

Blameless Expands Microsoft Partnership to Deliver Faster, More Intuitive Incident Response Collaboration

At Blameless, the world’s leading software engineering teams rely on us during incident management. A key part of our offering is the ability to seamlessly integrate with a customer’s unique tech stack. As such, we value partnerships with companies like Microsoft that enhance our user experience and meet the needs of our customers. We understand how essential it is to integrate with communication tools like Microsoft Teams, because it’s the first place a user goes to start an incident.

Everbridge Signal - Open Source Threat Intelligence to Keep People Safe and Operations Running

There are billions of people online right now. Among that noise is information that could be vital to your organization’s safety and security. Everbridge Signal will help you find relevant information using Artificial Intelligence and Machine Learning. Detect incidents in real-time by gathering data from public sources including the dark web, deep web and social media. Whether your issues are cyber or physical, Signal can help.

How to Run a Post-Mortem Meeting: Tips, Tricks & Checklist

Meetings are a necessary evil in any workplace. They can be long, tedious, and often unproductive. But post-mortem (PM) meetings are different. They are one of the most valuable meetings a service-oriented organization can have. Post-mortem meetings are an essential part of any project manager's toolkit. They provide an opportunity to reflect on what went well and what could be improved upon in future projects.

PagerDuty Apps for AWS + Automated Diagnostics Demo

Reduce downtime and customer impact with service ownership while enabling teams to drive continuous improvement and innovation Learn about how you can modernize and optimize your operations with our enterprise-grade set of AWS integrations. Automate incident response with PagerDuty’s Runbook Automation and learn about our new set of AWS plugins and prebuilt jobs that make it easier to get up and running with auto-diagnostics.

Upgrade your shopfloor alerting with Derdack

Over the last couple of months and service releases, we made continuous efforts to enhance Derdacks capabilities to collect, aggregate and alert shopfloor incidents for our Industry customers that primarily use OPC for alerting. In the accompanying projects, we made big improvements to our OPC Integration even added additional features. The OPC integration received a complete overhaul of the configuration and data management systems and can now handle OPC UA Alerts&Conditions.

5 reasons why you shouldn't buy incident.io

Not many companies will tell you why you shouldn’t use their product, but any product that tries to be everything to everyone is doomed to failure. When you build without a specific user in mind, your target becomes the intersection of many viewpoints, and what you build is the lowest common denominator. What usually follows is software that can technically do everything, but feels unfocused, complex, and unpleasant to use. Something everyone is equally unhappy with.

Fast Track series: easily integrate monitoring alert sources

Integrating all of your monitoring alert sources is quite a task. Large enterprises often struggle to aggregate millions of data records from dozens of monitoring, change, and topology tools in real-time. Filtering out the noise and prioritizing the most important alerts are crucial to a team’s success. BigPanda makes it simple to integrate with any monitoring alert sources with Open Integration Hub. Currently, we have more than 50 easy-to-use integrations to choose from.

RIA Vendor Selection Matrix for AIOps 2022

In July, the research firm Research In Action (RIA), published the 2022 edition of their annual Vendor Selection Matrix™. Despite AIOps being a well established technology (Moogsoft has customers who have been reaping the benefits of AIOps for many years) selecting a vendor can still be quite difficult, given the plethora of vendors who quickly re-branded their solutions as AIOps. So a vendor selection guide is a valuable resource.

Honeycomb Announces Major Updates to PagerDuty Integration

Today, we’re announcing major new updates to Honeycomb’s PagerDuty integration. These updates put more of the information you need into PagerDuty notifications and allow for greater configurability. These enhancements are available to all users who leverage Honeycomb Triggers and Burn Alerts to send notifications via PagerDuty.

New Feature: New Component Status Types

What’s just as important as resolving an impacted service? Providing detailed yet digestible updates to your communities and stakeholders. A recent update to StatusCast, involves the addition of three new status types that can be assigned to your components. Detailed communications is an essential component of incident response and management, and additional status types provide your users with a more granular view of incident activity.

SignalFlows to SLOs

How are you tracking the long-term operation and health indicators for your micro and macro services? Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are prized (but sometimes “aspirational”) metrics for DevOps teams and ITOps analysts. Today we’ll see how we can leverage SignalFlow to put some SLOs Error Budget tracking together (or easily spin up same with Terraform)!

What you need to know & do to be a world-class cyber incident responder

World-class incident responders are a strategic asset in today’s world where the frequency and sophistication of cyber security attacks continue to increase every year, as do the associated financial damages: As such, more and more organizations are looking to grow their cyber incident response expertise, both with inhouse staff as well as by engaging with third-party experts.

We're making our on-call calculator free

We've all done it: "that'll be simple, I'll just write a quick script and..." In the case of calculating on-call pay, we really have done it before: our team have built the on-call pay scripts for several companies, and each attempt was a painful, error prone process. While we believe everyone on-call should be paid for their inconvenience, relying on someones side-project or back-of-napkin maths to calculate pay leads to mistakes, frustration, and wasted time.

What is a Security Operation Center and how do SOC teams work?

With the growing complexity of IT environments, it is essential to have robust security processes that can safeguard IT environments from cyber threats. In this blog, we will explore how security operation centers (SOCs), help you monitor, identify and prevent cyber threats to safeguard your IT environments. This blog covers the following pointers.

Why you need an incident timeline

We get it – incidents happen. What differentiates resilient teams from others is how they learn from them: using them as an opportunity to find the biggest improvements in how they work. Incident timelines are one of the most simple and effective tools available to you when it comes to learning from an incident. It’s vital that you ensure they’re accurate and useful, in order to make the biggest improvements after an incident.

Everbridge Signal - Open Source Threat Intelligence to Keep People Safe and Operations Running

There are billions of people online right now. Among that noise is information that could be vital to your organization’s safety and security. Everbridge Signal will help you find relevant information using Artificial Intelligence and Machine Learning. Detect incidents in real-time by gathering data from public sources including the dark web, deep web and social media. Whether your issues are cyber or physical, Signal can help.

RESOLVE '22: Measuring what matters

Companies can take big strides toward “preventing preventable” incidents by minding what they measure. What’s in a name? In Measuring what matters, one of the panels at our RESOLVE ‘22 event, the three words in the title reflect a plan successful IT Ops teams have embraced to reduce the complexity of their reporting systems—resulting in a faster path for companies to make more effective use of all the IT resources at their disposal.

THWACK Livecast: Automating Your Way Beyond Simple Incident Management

Presented by: Kevin M. Sparenberg (KMSigma) and David Russell (david.russell.CSM) It’s time to take your service desk solution to the next level with automation rules. Built on the framework of simple rules, you can improve efficiency, refine standardized processes, and transform the way your organization runs. This THWACK© Livecast is all about how automation rules in SolarWinds© Service Desk can help lighten the load and allow your teams to focus on those big picture projects which actually improve the business. Let's stop getting bogged down in the minutia and manual interaction of incident management and instead look at ways to lighten your load.