Operations | Monitoring | ITSM | DevOps | Cloud

November 2021

Keeping people safe and operations running faster in Middle East healthcare

The COVID-19 pandemic has caused widespread concerns in the interconnected international Healthcare Industry. This is especially true in the Middle East, where many of the world’s most renowned institutions have international operations. Healthcare leaders need to maintain a delicate operational equilibrium, balancing keeping their people safe and getting critical treatment to patients. This becomes more complex with healthcare practitioners in foreign locations, where they may not be familiar with the local norms or laws and need real-time informative communications in the event of a critical event.

Observability and SaaS Providers

SaaS is exploding and so it should; it takes commoditized work and infrastructure away from tech teams so that they can focus on differentiating features. But what happens when it goes wrong? How do SaaS platforms make sure they aren't letting their customers down and in turn, letting their customers down? Observability, bolstered with AI gives all the partners the best chance to optimize availability and customer experience. Here's how.

6 Signs Your Incident Response Steps Are Working

Although IT incidents have always been a concern, the increase in customer-facing technology adds the cost of a bad customer experience to the cost of responding to and remediating an incident. While in a perfect world, you’d be able to prevent incidents from happening in the first place, the reality is they do happen and more often than most of us would like to admit.

PagerDuty at AWS re:Invent 2021-Deepening Our Collaboration with AWS

Across the globe, in-person technology events are beginning to emerge from their pandemic hibernation. For developers and DevOps teams, no event has been more anticipated than AWS re:Invent, which is back in Las Vegas, November 29th — December 3rd to help bring us all back together and slowly let us find our new normal. While handshakes may be replaced by elbow bumps or other newfound greeting rituals, we are excited to be back and see all of you in real life.

All About Incident Communication: What it Is, How to Do It, and Why It's Crucial for Business

No matter how much you try to avoid it, incidents are bound to happen. And while your first instinct is to resolve the issue, it shouldn’t be your only priority. By solely focusing on solving the problem and not communicating it to affected stakeholders, like team members and customers, you’re actively making the situation worse. In this article, we’ll discuss what’s incident communication and how to create a strong incident communication plan.

Enterprise Alert 9.1 Update brings Microsoft Teams and SIGNL4 connectivity

As announced at the User Group Meeting 2021, we are now releasing Enterprise Alert 9.1. This version brings a set of new features extending the capabilities in some crucial areas. Here is what’s new in a nutshell: As always you will find more details, release notes and downloadable installer files in the online user group. You can also watch the session from our UGM (no cookie embedding): Watch this video on YouTube

Supervisors in xMatters - xMatters Support

Join Chris Patch, xMatters’ Senior eLearning Specialist, as he outlines the privileges of supervisors in xMatters. Supervisors can modify the user profile of any user they supervise, as well as can view their groups, change their login details, and sign out of the mobile app on their behalf if their account has been compromised.

Who Should Be On Your Incident Response Team?

When an incident strikes, an organization’s reputation and revenue, as well as customer trust are at stake. Assembling an effective incident response team is critical to minimizing the incident’s impact. But what exactly is an incident response team? Who should be a part of the team and what are their responsibilities? Successful incident responses require a team with a diverse set of problem-solving and communication skills.

How AWS & xMatters Drive Monitoring and Observability Forward - xMatters Demo

Join Tiberiu Oprisiu, Solution Architect at AWS, Eric Maxwell, Solution Architect at xMatters, and Rutuja Rajwade, Partner Marketing Manager at xMatters, as they highlight and demo the benefits that come from pairing AWS with xMatters. Learn from Tiberiu which business imperatives drive observability, and what, why, and how AWS can do just this. And, stick around to see Eric dive deep into matters Flow Designer to see how these workflows can be set up with ease!

A vital alerting solution

This article should give you a first idea of what SIGNL4 does. What do IT security, production monitoring and technical field service have in common? In all these scenarios, the right people need to get notified immediately – in case of technical malfunctions, urgent maintenance orders or emergencies, all in order to solve any incident quickly and efficiently.

How We Deploy Product Releases at xMatters

With Halloween behind us and the holiday shopping season fast approaching, engineering and product teams know what that means: code freezes! At xMatters, code freezes are a part of our product release process in anticipation of the busiest — and most important — time of the year for many of our customers. But code freezes are just one piece of the puzzle in how we ensure our customers have the most reliable experiences. The way our product releases are designed is much more than that.

Partner Integration - Dynatrace with PagerDuty and Rundeck

Deliver perfect software experiences with real-time intelligence into customer satisfaction and behavior, your applications, and the performance of your hybrid multi-cloud. AI-powered root-cause analysis automatically identifies customer facing performance issues and pinpoints the root-cause within seconds. Open APIs allow ingestion of 3rd party metrics and enable complex system integrations. In this demo, Rob Jahn shares a sophisticated incident remediation workflow incorporating intelligence from Dynatrace, automation in Rundeck, and incidents in PagerDuty.

SLA vs. SLO (Differences Explained)

Wondering about SLAs and SLOs? We explain service level agreements and service level objectives, their differences, and the importance of each. What are the major differences between service level agreements (SLAs) and service level objectives? An SLA is a legal agreement between the business and the customer that includes a reliability target and the consequences of failing to meet it. An SLO is an internal target that measures how customers use the service.

Building safe-by-default tools in our Go web application

At incident.io, we're acutely aware that we handle incredibly sensitive data on behalf of our customers. Moving fast and breaking things is all well and good, but keeping our customer data safe isn't something we can compromise on. We run incident.io as a multi-tenant application, which means we have a single database (and a single application).

4 Ways To Ensure Reliability of Your Digital Services for GivingTuesday

In today’s digital economy, seconds matter. For mission-driven organizations, seconds can be a matter of life and death, and service reliability can make or break access to suicide and safety hotlines, disaster relief, time-critical health care, food assistance, and more. That’s where real-time digital operations comes in.

DevOps Benefits & How to Maximize Them for Your Team

Curious about DevOps benefits? Whether you are just adopting DevOps or improving your current process, we explain the top benefits and how to maximize them. What are DevOps benefits? In DevOps, the operations and development work closely together during the entire software development lifecycle. The collaborative approach in DevOps leads to many benefits, including.
Sponsored Post

Using Predictive Analytics Capability to Resolve Critical Incidents

CloudFabrix solution provides a holistic approach for enterprises to implement proactive operations with the objective of eliminating/reducing critical incidents and improving customer satisfaction. The solution primarily relies on applying regression/forecasting models on any time-series data to detect and forecast anomalies. One of the unique features of the solution is the ability to convert unstructured data such as logs/incidents/alerts into time-series data to be used for running prediction models.

Growing pains: the IT Ops maturity model

Modern IT Ops environments have many moving parts that need to work together well, yet are evolving at different speeds. This gap in maturity creates many problems. In this CTO Perspective, Jason Walker, Chief Customer Officer at BigPanda, discusses why IT Ops teams should prioritize maintaining a common maturity across all their IT operations, and how best to do that.

Deploying to production in <5m with our hosted container builder

Fast build times are great, which is why we aim for less than 5m between merging a PR and getting it into production. Not only is waiting on builds a waste of developer time — and an annoying concentration breaker — the speed at which you can deploy new changes has an impact on your shipping velocity. Put simply, you can ship faster and with more confidence when deploying a follow-up fix is a simple, quick change.

Training Intelligent Alert Grouping

Complex incidents are both exhausting and commonplace. In this case, incidents that I am referring to as “complex” are incidents that involve multiple, disparate, notifications in your alert management platform. Perhaps these incidents are logically separated because the underlying systems or services were seen as less coupled than they turned out to be in reality.

Fail-Safe Digital Scheduler for On-Call Management

In this video, we discuss how OnPage's advanced, fail-proof digital schedules enable organizations to distribute workload evenly among scheduled, On-Call team members. The OnPage scheduler starts out "FULL" and schedules are created on top of it. This guarantees that a notification is delivered reliably, even when a slot is left empty on the scheduler. The scheduler reverts to the default group order and the entire group is notified, ensuring continuous coverage across your organization.

Tis The Season: Protect Your Availability During The Holidays

Deck the halls! It's time for the annual holiday Code Freeze, that festive time of year when businesses impose a precautionary halt to code changes and Operations should be quiet. But before you kick up your feet, make sure that demand doesn’t lead to availability embarrassments. After all, retail experts suggest that we’re in for another online-heavy holiday shopping season, so businesses need to brace for increased digital traffic...with little tolerance for failure.

Partner Integration on Twitch: Lacework

Lacework delivers complete #security and #compliance for the cloud. While the cloud enables enterprises to automatically scale workloads, deploy faster, and build freely, it also makes it increasingly difficult to: maintain visibility, remain compliant, stay free from known vulnerabilities, and track activity in both host workloads and ephemeral infrastructure within their environments. Integrate Lacework with PagerDuty to route Lacework Events to responders on your team. Manage and resolve configuration issues, behavioral anomalies, and compliance requirements in a timely manner across your cloud infrastructure.

How to Write Meaningful Retrospectives

One of the foundations of incident management in SRE practice is the incident retrospective. It documents all the learnings from an incident and serves as a checklist for follow-up actions. If we step back, there are 7 main elements to a retrospective. When done right, these elements help you better understand an incident, what it reveals about the system as a whole, and how to build lasting solutions.

5 ways incidents made me a better engineer

Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams to solve unexpected and challenging problems. In my career, I've found incidents can be a great accelerator - for both myself and others around me. It was after leading my first incident at GoCardless that I started to feel really comfortable in the codebase and the team.

Fall 2021 Launch: Automate Incident Response to Accelerate Critical Work

Modern businesses are digital businesses—so managing your business means mastering your critical services and operations for your employees and customers. Today, you need to be able to understand every aspect of your company—as it unfolds—because in this world, seconds matter to your productivity, your revenue, and most importantly, your customers.

Achieving Operational Resilience for Cellular Carriers

The world is changing, and with great change comes an evolving threat landscape. Increases in physical and digital disruption, such as civil unrest, cyberattacks, severe weather events, and unplanned outages, have left many industries scrambling to secure a robust operational resilience strategy, including the cellular industry. Today’s evolving threat landscape poses a unique threat to cellular carriers, whose business is growing at a breakneck pace.

Mobile Service Dispatching for In Plant Transport Logistics at BASF Coatings

BASF is the largest chemical producer in the world with a revenue of EUR 59bn, 247 manufacturing sites and 110,000 employees. BASF’s Coatings division employs 11,000 people and develops, produces and markets innovative solutions for automotive OEM and automotive refinish coatings and industrial coatings as well as architectural coatings and related coating processes.

IT Failures are Inevitable

As infrastructure stacks grow increasingly complex and involve an ever-growing number of services, system failures are becoming more and more common. There can be a variety of reasons why systems fail: software bugs, misconfiguration or interactions between services that cause unexpected behavior, the network is down, and of course, those rare occasions where natural events can render data centers inoperative.

Sponsored Post

Your Guide to Developing a Fail-Safe Incident Response Plan

Incidents happen. Every organization's technical team will face an incident sooner or later, whether planned or unplanned.An incident can be declared or initiated in response to an event or combination of events that affects the integrity or availability of a system or service in a way that impacts core business processes.

Minimize the impact of critical incidents with Freshservice On-Call Management

“Service outage! Help!” These words (or their variations), have preceded notable losses of millions and billions of dollars in the 21st century. From large corporations to SMBs, no one is immune to the effects of downtime – whether planned or unplanned. However, the earlier an issue is noticed, the faster it is acted upon and resolved, resulting in little or no customer impact.

Monitoring & Observability for Sales, Marketing and Business ops teams with StackMoxie and PagerDuty

Before Stack Moxie, every business ops team needed PagerDuty, but finding and pushing errors was a manual process. With Stack Moxie + PagerDuty, every business op professional can manage their sales, marketing, HR or customer success stack with the same quality engineers bring to code.

OnPage's Clinical Communication and Collaboration Solution

Modern healthcare teams require a modern solution to streamline clinical communications and medical workflows. In life and death situations, it’s critical that physicians receive immediate alerts and messages to provide patient care promptly. OnPage is the industry’s most trusted clinical communications platform. OnPage is more reliable and secure than traditional pagers. The system enables care teams to easily communicate and achieve maximum patient satisfaction.

4 IT Challenges Addressed by OnPage Automated Alerting

IT organizations are challenged with delivering quick, effective resolution to customers’ database, hardware or software downtime issues. Contractually binding service-level agreements (SLAs) place further pressure on IT engineers to accelerate incident resolution time and minimize downtime. Though engineers are obligated to meet their SLAs, they are unable to do so without the help of an automated alerting system.

Logs and tracing: not just for production, local development too

We're a small team of engineers right now, but each engineer has experience working at companies who invested heavily in observability. While we can't afford months of time dedicated to our tooling, we want to come as close as possible to what we know is good, while running as little as we can- ideally buying, not building. Even with these constraints, we've been surprised at just how good we've managed to get our setup.

Avoid frostbite: Stop doing code freezes

As the holiday season aggressively approaches I want to perform a public service announcement for everyone toying with the idea of a code freeze for the holidays: please don't. It’s getting cold outside and the season of peppermint mochas is upon us, which might get you thinking about putting a code freeze in place for the holidays. A Word of warning: instituting a code freeze may have unintended consequences.

Outage or Breach - Confront with Confidence (2021)

A Recent Dice Article Titled – Data Breach Costs: Calculating the Losses referenced a 2021 IBM and Ponemon Institute study that looked at nearly 525 organizations in 17 countries and regions that sustained a breach last year, and found that the average cost of a data breach in 2020 stood at $3.86 million.

Reliable incident alerting for critical IT systems at German health insurance provider Debeka

“Thanks to Enterprise Alert and the acknowledgement function, we can track the alerting and response digitally and have the certainty that our employees always take care of incidents in our critical IT infrastructure in a timely manner. IT alerting with Derdack, which has to be documented according to BaFin KRITIS, is highly reliable.”, Markus Reusch, Product Owner Monitoring, Debeka

How to improve your influence as an SRE

Improving your influence over the company will help you deliver high quality work as your goals will be closely aligned with those of the company. In this blog piece, Ricardo has explained how to improve your influence as an SRE. Balancing fast-paced business requirements with the demands of keeping production services stable is not an easy task.

Playbooks in Action: Creating Effective, Repeatable Incident Resolution Workflows

While service incidents can be wildly dissimilar, they tend to have one thing in common: a need for quick resolution. Response teams need a robust, repeatable process to follow that ensures fast, mistake-free execution, especially for those 4 AM calls. Having a documented checklist saved where the entire team can access and use it at any time could make the difference between quick resolution or compounding the problem.

4 Recommendations for Optimizing DevOps

The concept and development of DevOps have significantly changed the way IT teams work in the last decade. Small and large teams alike can see the difference when they switch from traditional software development cycles to a DevOps cycle: accelerated innovation, improved collaboration, faster time to market. And the list of benefits continues to grow. To effectively embrace DevOps, however, is not an easy task. Thankfully, there are ways to navigate this challenging journey.

Announcing Grafana OnCall, the easiest way to do on-call management

A critical part of managing modern software development is setting up and running an on-call rotation. But that often involves significant toil, in part because many of the existing tools are cumbersome and not developer-friendly. That’s why we’re excited to announce Grafana OnCall, an easy-to-use on-call management tool that will help reduce toil in on-call management through simpler workflows and interfaces tailored for devs.

Now you see me, now you don't: feature-flagging with LaunchDarkly at incident.io

At incident.io, we ship fast. We're talking multiple times a day, every day (yes, including Fridays). Once I merge a pull request (PR), my changes rocket their way into production without me lifting a finger. 💅 It's when we tackle larger projects that this becomes a bit more complicated. We recently launched Announcement Rules, which let you configure which channels incident announcements are posted in depending on criteria you define.

Your Ops and DevOps teams need to work together, and fast. Who you gonna call?

The world is moving fast, led by an ever-accelerating IT landscape. In recent years, two distinct types of teams have emerged that assist in driving this business transformation: DevOps/SRE teams that are in charge of driving rapid innovation of products and services, and IT Ops/NOC teams that focus on preventing outages and maintaining the high level of quality, reliability and serviceability that modern, discerning customers expect.

How Playbooks improve customer service delivery, agent productivity

We all know one bad experience can impact a customer’s perception of—and even willingness to deal with—an organization going forward. That’s why so many companies, in virtually every industry, have made investing in customer experience (CX) a top priority, according to ResearchAndMarkets.com. The problem is, for any given organization, there are a number of customer service processes along the entire life span of an interaction that need to be looked at and made great.

New Apps for PagerDuty's Datadog Integration

Status Dashboard by PagerDuty and Incidents by PagerDuty are new apps available now in Datadog. See a live, shared view of system health to improve awareness of operational issues with Status Dashboard by PagerDuty. Acknowledge, troubleshoot, and resolve incidents with PagerDuty actions embedded directly in the Datadog interface to limit context switching among tools. Julia Nasser and Hadijah Creary join the stream to show off this powerful enhanced integration.

Make sense of complex systems with Dynamic Service Graph by PagerDuty

The Dynamic Service Graph breaks down silos between teams and provides organizations with a living, breathing asset that displays technical and business services and their relationships at scale. It allows teams to quickly grasp the state of services, visually digest the full impact radius of an issue, zero in on likely cause, and seamlessly facilitate cross-team collaboration.

Leaning on Technology in The New Noisy: Managing Cloud, Change and Risk

Your company’s “digital transformation” will be driven by new application designs and methods, new technology stacks, and new processes. To master it, and delivering next generation services through it, massively complex sets of signals and data need to be leveraged, processed, and acted on. Developers need integrated data and insights through that noise, while being able to leverage their tools of choice. All of this must be managed, even in spite of massive rates of change and innovation.

Visualize and manage all of your services in one place with Dynamic Service Graph

In this digital era, technology systems are becoming increasingly complex. No longer can a single SME (subject matter expert) understand every facet of the system they run. Instead, much of this knowledge is siloed and exists as tribal knowledge within certain teams. Additionally, the rate of change is faster than ever, with code deploying and new services shipping at a rate unimaginable a few years ago.

What's New in the PagerDuty Terraform Provider - PagerDuty Garage (Oct 29, 2021)

The Terraform PagerDuty provider is a plugin for Terraform that allows for the management of PagerDuty resources using HCL (HashiCorp Configuration Language). Manage your PagerDuty account with Infrastructure as Code. #infrastructureascode For more info on the PagerDuty provider for #Terraform, see the documentation on the Terraform Registry.

How they SRE: Insights from the Cloudflare SRE team

Cloudflare is a global cloud services provider that is based all over the globe, from San Francisco, US to London, England to Sydney, Australia. Their mission, as stated front and center on their homepage, is to help build a better Internet. While that may read like hyperbole, their numbers are impressive - Cloudflare has over 126,000 paying customers and 95% of Internet Users in the developed world are within 50ms of their network.

OnPage Integrates With Single Sign-On Solutions to Improve Secure Authentication

WALTHAM, Mass., Nov. 3, 2021 — OnPage Corporation, a Boston-based incident management company, today announced the availability of new integrations with leading single sign-on (SSO) solutions Okta and OneLogin. The latest integrations allow for a secure authentication process when users log in to the OnPage system using their SSO account credentials.

November 2021 Update - Improved incident response with team escalation and more

Our November update introduces new team settings and, along with them, entirely new options for escalating Signls. This will allow you to make your incident response even more reliable. One application is to create a ‘managers on duty’ teams with full duty scheduling capabilities and escalate missed Signls to such 2nd level response team. As always, you can find all the details in this article.

How UK Healthcare Reduced Incident Response Times from Minutes to Seconds - xMatters Demo

When there's a high severity incident at a hospital, it could be a life or death situation. So how do you get in contact with the right doctors and clinicians in such a busy environment when tensions are running high? Join Glenn Steketee, Technology Service Analyst at UK Healthcare, Sonu Sekhon, Customer Success Manager at xMatters, and Will Derksen, Product Advocate at xMatters, to discuss how xMatters reduced incident response times from minutes to seconds.

Unlocking Climate Change Resilience Through Critical Event Management and Public Warning

Across the globe, both public and private sectors are more concerned than ever about addressing climate change and its associated risks. “In the period 2000 to 2019, there were 7,348 major recorded disaster events claiming 1.23 million lives, affecting 4.2 billion people (many on more than one occasion) resulting in approximately US$2.97 trillion in global economic losses,” according to a report conducted by the UN Office for Disaster Risk Reduction (UNDRR).

What's New in xMatters: The Ninja Release

Get ready for something exciting coming your way! xMatters latest release, Ninja, is on the horizon and will be available in production next week. Named in honor of the classic video game Ninja Gaiden, this latest batch of xMatters updates is sure to pack a punch — pun definitely intended. This release rolls out exciting new features like an intelligent Service Dependencies map and integrations with the broader Everbridge platform, among many other things.