Operations | Monitoring | ITSM | DevOps | Cloud

January 2023

Analytics in Squadcast | Visualize Team and Organization Level Analytics | MTTA MTTR | Squadcast

Analyzing incident data plays a key role to do better SRE. Squadcast's Analytics Dashboard helps you analyze the performance of your Organization/ Team, for a given time period. It also gives you more insight into past outages that affected your systems.

OnPage - Never Miss a Critical Alert Again (For IT, Clinical Comm. and Collab. & Crisis Comm.)

OnPage is an Incident Alert Management platform that elevates critical notifications to the right person on call to remediate critical events. With Alert-Until-Read capabilities, dynamic digital schedules, escalation policies, incident reports, and redundancies, OnPage aims to ensure that critical alerts are never missed. OnPage serves many industries including, healthcare, information technology, managed services, IoT, and manufacturing. With over 250+ integrations, the solution extends incident alert management to popular ITSM (ticketing), RMM, monitoring and cybersecurity tools. On the healthcare front, OnPage integrates with popular scheduling, IoT, nurse calls, and EMR systems.

A Complete Guide to PagerDuty Alternatives

Exploring Options for Incident Management: A Comparison of PagerDuty and Other Tools Effective incident response is crucial for managing operational issues and resolving them in a complex technology environment. With the increasing complexity of systems built from numerous services, it is important for companies to have a way to keep these systems running smoothly.

What's New: January 2023

We’re excited to announce a new set of updates and enhancements to the PagerDuty Operations Cloud. Recent development and app updates from the product team include Incident Response, PagerDuty® Process Automation, the PagerDuty Mobile App, Integrations, as well as Community & Advocacy Events updates. We continue to help customers further automate to optimize cloud operations and reduce the amount of issues escalated to other teams. Get started now and learn about.

Sponsored Post

What are Network Operation Centers (NOC) and how do NOC teams work?

Modern-day markets are highly competitive and in order to foster stronger customer relations, we see businesses striving hard to be always available and operational. Hence, businesses invest heavily to ensure higher uptime and to have dedicated teams that constantly monitor the performance of an organization's IT resources. In this blog, we will explore what NOC teams are and why they are important.

Create a Status Update Notification Template in less than 2 minutes

Now generally available! With organization-based templates, companies can now customize and standardize communications based on impact, service areas, and more. This functionality will be also available via API, so teams are able to customize and leverage status update notification templates to fit their needs in any tool or context.

What are AIOps use cases?

The past decade has seen organizations embrace AI and data analytics at scale. In 2022, IBM found that 35% of organizations have embraced AI—a 4% increase from 2021. The trend of AI adoption will continue to play out in the next several years across virtually every organizational function. At the vanguard of this movement is AIOps, which sees AI used to improve IT operations (ITOps).

What Is IT Mapping and How Can it Prevent the Next Production Incident?

IT infrastructure mapping is the process of creating a visual topology of a network infrastructure. This mapping process helps understand the geographic and interactive layout of a network, which applications depend on. Using infrastructure mapping for troubleshooting, you can quickly understand the relationship between application issues and hardware issues.

Create Better UX with Incident Response and Service Intelligence

Incidents that impact user experience are some of the most common challenges that IT, security, and operations teams must face. Users have high expectations for application uptime, and organizations are responsible for ensuring applications are available for them. From application performance to user interface design, many factors can affect a customer’s experience—and resulting confidence—in your product’s capabilities.

Say Goodbye to the 'Executive Swoop and Poop' with Status Update Notification Templates

Incidents are unpredictable, but how you share updates with stakeholders doesn’t have to be. Status Update Notifications Templates help teams streamline communication with internal stakeholders during a major incident. We are excited to announce that this feature has added new capabilities.

DataScan transforms incident response & business continuity tests

With more than $80 billion of loan collateral in its systems, DataScan is an industry leader in providing solutions for wholesale asset financing and inventory risk management. The company’s InfoSec leadership understood that they needed to take a whole new approach to incident response and to advance its security maturity. Having multiple tools for managing incidents and conducting business was translating into inefficiencies, prolonged resolutions, and stress.

5 Exciting Predictions for SRE in 2023

SRE is a field defined by its constant evolution: from Google’s in-house secret recipe, to the hottest new practice for the biggest enterprise orgs, to a diverse and holistic mentality practiced by orgs of all sizes. Earlier this year, we co-sponsored the Catchpoint State of SRE survey, where we took the temperature of SRE where it was. Now, as we did in 2021 and 2020, we’ll turn to the future to speculate on what 2023 will bring for SRE. ‍
Sponsored Post

Using AIOps for Better Adaptive Incident Management

An effective incident management strategy is crucial for any business, especially those offering consumer-facing digital services. This is because when incidents occur, they may be easily detected by your users, impact your reputation, and ultimately affect your bottom line. So, to minimize the reach and severity of incidents, your response needs to be swift and effective. One way to ensure your approach meets these requirements is to implement AIOps.

Sponsored Post

Runbook Automation as a Baseline for Controllability and Observability

Some of the highest priorities for engineers - from NOC Engineers, DevOps & Site Reliability Engineers - are the automation and optimization of their production environments. Many companies today face tough challenges with their Network Operations Centers (NOCs) or production environments. These challenges fall into the hands of engineering teams.

ITIL vs. ITSM - What's the difference?

Companies depend on IT services to support their business operations, and to meet the demands of their customers. ITIL (Information Technology Infrastructure Library) and ITSM (Information Technology Service Management) are frameworks to help organizations manage their IT services. While these two do have elements in common, they also have important differences. ITIL is a set of best practices for IT service management which emphasizes the alignment of IT with the needs of the business.

How we approach integrations at incident.io

If you pick a random SaaS company out of a jar and go to their website, chance are they integrate with another tool. Typically, the end goal of integrations is to meet users in the middle by working with other tools they’re already using on a day-to-day. Put another way, integrations are a strategic business decision. But the question remains: why don’t companies just build a tool with similar functionality in order to make the product stickier?

The Risks Of Using Small Status Page Vendors

Servers are down. Employees are scrambling. Customers are upset. The pressure is on. When internal operations are in disarray, and your business is experiencing a service outage, the last thing you need to worry about is the reliability of your incident communication solution. Keeping users informed when services are down is mission-critical, in order to prevent a flood of support requests, which compound the effects of the incident, straining employee productivity and bandwidth.

PagerDuty and FiberPlane Integration Demo

Presenter: Aparna Valsala, Solutions Engineer at Fiberplane, Using the PagerDuty and Fiberplane integration, the responding engineer can immediately start the investigation using a predefined and configurable Fiberplane template visible to all while allowing multiple engineers to collaborate on the investigation with complete visibility and context.

Causes of Data Center Outages and How to Overcome Them

With the increasing computing requirements and complexity of data center systems, unplanned downtime has become a severe threat to enterprises in terms of process violations, revenue losses, and reputational issues. Although data center failures are quite common, it can be difficult to predict every scenario that might have a severe impact on the expansion of your company. Especially when some factors, like a natural disaster, can simply be beyond your control and result in data center outages.

APIs Impact on DevOps: Exploring APIs Continuous Evolution

An application programming interface (API) is a set of rules and protocols that enables different software applications to communicate and share data and functionality. The concept of an API has been around for a long time. However, APIs as you know them emerged in the late 1990s and early 2000s with the rise of the internet and web-based services. As more businesses began to offer online services, the need for a standardized way for these services to interact and share data became apparent.

How to talk to your executive leadership team about reliability

Product reliability requires investment from all areas of the business. Technology leaders must effectively communicate the implications of service reliability to the rest of the organization. As a leader, how do you prove that a more reliable product is critical to success? Experts from BetterCloud, Machinify and Blameless come together to discuss how to talk to your executive leadership team about reliability in this webinar.

How to talk to your executive leadership team about reliability

Product reliability requires investment from all areas of the business. Technology leaders must effectively communicate the implications of service reliability to the rest of the organization. As a leader, how do you prove that a more reliable product is critical to success? Experts from BetterCloud, Machinify and Blameless come together to discuss how to talk to your executive leadership team about reliability in this webinar.

The Inevitable - Failures in Distributed Systems

Experiencing failure at scale is as the popular Marvel character Thanos would say “Inevitable”. Memory leaks, software or hardware or network I/O failures are just a few. It’s a problem of simple mathematics, the probability of failing rises as the total number of operations performed increases. With each component used to scale the application, the failure quotient increases. So how do you tackle this so-called “Inevitable” problem that comes with scaling?

IT Workflow Explanation

IT Workflow Automation serves to automates the execution of IT tasks and processes. This can include everything from provisioning new servers and deploying software updates to monitoring and troubleshooting IT systems. Workflow automation helps organizations reduce the time and effort required to perform these tasks by automating manual processes and eliminating the need for manual intervention. It can also improve the accuracy and consistency of these processes, as there is less room for human error.

10 Points of consideration for investing in an Observability Platform for your organization.

10 Points of consideration for investing in an Observability Platform for your organization: Scalability Can the observability platform handle the volume of data that your organization generates? Compatibility Is the observability platform compatible with your organization's existing systems and technologies? Ease of use Is the observability platform user-friendly and easy for your team to adopt and use?

[PODCAST] Episode 1 Season 2; How to successfully build and defend your 2023 ITOps budget

It’s that time of year when ITOps leaders quantify their plans in budgets that must compete with other equally hungry groups for limited corporate resources. How can the thankless task of proactively preventing outages and speeding time to resolution win against funding flashier projects? Real-world facts can make that difference. Some of the major topics Nigel and Craig will discuss is how to help organizations successfully build and defend their 2023 ITOps budget for investments in tooling, headcount, and workflow improvements.

PagerDuty Status Pages Enable Real-Time, Proactive Customer Communication During Incidents

Integrated, Intuitive Feature Saves Time and Money, Aligning Technical and Customer-Facing Teams, Allowing Further Consolidation on to the PagerDuty Platform, and Building Customer Trust During Large-Scale Events.

5 best incident management tools of 2023

Put simply, managing incidents—big or small—is good for business. Not only is it a regulatory requirement, but also a factor in your profits. Your customers expect smooth operations, good customer service and protection. A dedicated incident management tool can help protect all of these. While many may think of incidents as an IT or DevOps issue, it’s hard to over emphasize that they can happen in any department.

Incident Management Tools - Do I Even Need Them?

Software is hard… Maintaining software reliability is harder than it used to be. Software systems have grown dramatically in complexity, as they’re applied in a wider range of applications and environments. Many of which have become fundamental to the everyday function of our society. On the other hand, the pace of software development and release is also faster than ever. Innovating new features faster than competitors has become the key to success in a rapidly-changing market.

Managing incidents in a growing organisation - incident.fm

In this week's episode, we're joined by Matt Huxtable, CTO at Ziglu (an e-money issuer, offering a variety of digital finance services, particularly well known for its cryptocurrency services). Matt talks about how the engineering team at Ziglu has evolved over time, building an agile culture and why "keep it boring" is his mantra. Chris, Pete and Matt cover how to context switch between solving and communicating during an incident, their most creative incident fixes and why AI isn't ready to solve incidents for us just yet.

Easy to manage fine-grained access control and roles

A neatly setup access control telling which user can do exactly what on an incident management platform can save a lot of time and hassle in the future. In the past, Spike.sh had only 2 roles - Admin and Member. The only difference in these roles were that only Admins can remove members. It was fairly simple and most users liked it. However, with larger teams coming onboard, it gets a little difficult to control for admins. So, we have empowered the existing system by adding two more roles.

Need your own incident post-mortem template? Here's ours

Having a dedicated incident post-mortem is just as important as having a robust incident response plan. The post-mortem is key to understanding exactly what went wrong, why it happened in the first place, and what you can do to avoid it in the future. It’s an essential document but many organizations either haphazardly put together post-incident notes that live in disparate places or don’t know where to start in creating their own post-mortems.

CDI's evolution with BigPanda: from partner to customer

CDI’s partnership with BigPanda has catapulted them to the forefront of modern IT operations. Through reselling and implementing BigPanda’s technology for customers, CDI saw the remarkable value of the platform and began to integrate it into their own business. In the process, they’ve become a partner and a customer—leveraging the product to transform their own operations in ways that previously seemed unimaginable.

Introducing PagerDuty Status Pages for Improved Customer Communication and Savings

In 2023, the fight to retain customers will be one of the biggest factors determining whether a business can survive the recession all are predicting. One of the key findings from the 2022 State of Service Report from Salesforce is that great service is at the heart of customer retention: 48% of customers will switch brands for better customer service when something goes wrong, and they view open communication as a key factor in how a customer might gauge the quality of customer service.

What is incident management? Maximize uptime and minimize disruptions with ServiceDesk Plus

Incident management is the process of restoring IT services to normalcy as quickly as possible. You can check out our comprehensive guide on incident management to learn more about how you can implement incident management best practices in your organization..

Lessons from the CircleCI Security Incident

In some respects, security and reliability are competing priorities. Security controls may reduce reliability, and responding to security incidents may require mission-critical systems to be paused or shut down until they're secure. The recent security incident involving CircleCI, however, shows that it's not always necessary to choose between prioritizing security or reliability.

2022 BigPanda product year in review

The start of a new year often includes reflecting on what you accomplished over the past year and setting new goals for the year ahead. In 2022, BigPanda set big goals to help organizations prevent and resolve IT and service outages through our innovative Incident Intelligence and Automation platform, powered by AIOps. On average, our customers sent us 2.3 billion events and changes per month, with our largest customers by volume sending us approximately 165 million events each.

Critical Metrics and Alerts in the Continuous Delivery Process

Continuous delivery is a software development approach in which code changes are automatically staged for production release. A foundation for modern application development, continuous delivery extends continuous integration by automatically deploying code changes to test and production environments after the build phase. When properly implemented, developers have deployable build artifacts that have passed a standardized testing process and can be deployed to environments as needed.

Playbooks: A new superpower for designers

From one designer to another, you should know why Playbooks is a fantastic addition to your design tool belt. Playbooks were designed with technical workflows in mind, from incident response to release management, but its flexibility makes it a perfect fit for any repeated process. I love it for creating reusable templates of design checklists and an excellent way to do design review sign-off.

Failure Analysis: Engineering incidents are a bigger problem than you think

Engineering incidents can be quite harmful for companies, both in terms of financial costs and reputational damage. In some cases, engineering incidents can even put people's lives at risk, which can have serious legal and moral implications for the company involved.

How communication can make or break your incidents - incident.fm

In this episode, Pete and Lisa discuss why great communication is essential to the success of any incident management process. From keeping your wider team in the loop to minimise disruption, to using customer communication to strengthen your brand when things go wrong, the team share their experiences and top tips for having a transparent incident communication culture.

PagerTree Broadcasts

PagerTree broadcasts are a great way to send mass messages to multiple teams or users (think of an all hands on deck situation). When using the broadcasts feature you can send one way messages and optionally request a response. PagerTree intelligent on-call alert routing gives teams flexible schedules, escalations, & reliable notifications via email, SMS, voice, chatbots, & smartphone app.

How to Avoid Common Software Deployment Challenges

Software deployment is the manual or automated process of making software available to its intended users. It’s often the final—and most important—stage in the Software Development Lifecycle (SDLC). Software deployment is a three-stage process: All software deployments pose challenges, and issues can arise in any of the three stages.

The State of AIOps: A New Years' Message from Chief Moo Phil Tee

Well, that was fast! Another year has come and gone. It is safe to say 2020, ‘21 and ‘22 were exceptional, and only sometimes for good reasons. But I take heart in society’s steady progress toward digital maturity through it all. Nearly 100% of IT leaders say the pandemic accelerated their organization’s rate of digital transformation.

How JPMorgan Chase uses Grafana and AI to monitor SLOs, SLIs, and more

For the team at JPMorgan Chase, the daily stakes of having a stable system are high. “We are in the business of making sure that trades are executed, and systems are stable and up and running for a positive client experience,” said Askari Imam, VP, Asset Wealth Management (Product and Integration Delivery).

A better way: 3 incident response areas prime for automation

By automating some rote parts of incident response, you reduce decision fatigue and help responders get to solving the problem faster with less stress. In this post, we talk about three areas of the incident response process that are prime for automation.

Identify and resolve incidents faster with InsightFinder's offering in the Datadog Marketplace

InsightFinder is a SaaS platform that uses AI-backed predictive analytics to predict and prevent production incidents. Using InsightFinder with Datadog, you can quickly identify hidden correlations in your application metrics, logs, and events and address application issues before they devolve into production outages and create customer impact.

Gartner IOCS Blog - Lucid Motors Case Study

Assaf Resnick, CEO and co-founder of BigPanda, sat down with Sanjay Chandra, vice president of information technology at luxury electric automaker Lucid Motors, at Gartner IT IOCS 2022. They discussed Lucid’s unique ITOps journey and how BigPanda helps minimize downtime of critical applications and services. Sanjay is a visionary ITOps leader, responsible for IT, enterprise systems, global infrastructure, operations and security at Lucid Motors.

What is Automated Diagnostics? How to reduce escalations and accelerate resolution with automation

Join PagerDuty’s Jake Cohen (Senior Product Manager) with RedMonk’s Kelly Fitzpatrick for a conversation and demo on automated diagnostics, process automation, and incident response. It’s all about automation helping first responders determine if there is an issue, which domain experts (if any) should be brought in to assist, and resolving the issue as quickly as possible.

PagerDuty and RedMonk Present: What is Automated Diagnostics? Part 1 - Use Case

Join PagerDuty’s Jake Cohen (Senior Product Manager) with RedMonk’s Kelly Fitzpatrick for a conversation and demo on automated diagnostics, process automation, and incident response. It’s all about automation helping first responders determine if there is an issue, which domain experts (if any) should be brought in to assist, and resolving the issue as quickly as possible. Part 1 of this 2-part video focuses on the concept and use case of automated diagnostics.

PagerDuty and RedMonk Present: What is Automated Diagnostics? Part 2 - Demo

Join PagerDuty’s Jake Cohen (Senior Product Manager) with RedMonk’s Kelly Fitzpatrick for a conversation and demo on automated diagnostics, process automation, and incident response. It’s all about automation helping first responders determine if there is an issue, which domain experts (if any) should be brought in to assist, and resolving the issue as quickly as possible. Part 2 of this 2-part video focuses on the concept and use case of automated diagnostics.

How communication can make or break your incidents

In this episode, Pete and Lisa discuss why great communication (both internally and externally) is essential to the success of any incident management process. From keeping your wider team in the loop to minimise disruption, to using customer communication to strengthen your brand when things go wrong, the team share their experiences and top tips for having a transparent incident communication culture.