Operations | Monitoring | ITSM | DevOps | Cloud

July 2023

Trending: Automation in I&O Optimization according to the Gartner 2023 Hype Cycle

In this blog, we take you through the latest trends in I&O optimization as Gartner’s report Hype Cycle for I&O Automation, 2023 predicts the widespread adoption of automated tools supporting IT infrastructure. This blog focuses on tools—like OnPage’s incident alert management solution—likely to be widely adopted as a standard for I&O optimization in the near future.

The Unplanned Show, Episode 7: Death of the Single Security Pane of Glass with Heather Hinton

In this episode, Heather Hinton describes how security teams can evolve away from spending cycles on “silly little jobs” and scouring multiple sources to try to identify the kinds of unplanned interrupt work that needs to be dealth with urgently. Instead, they can complete projects faster and take on more because on-call rotations are spent getting work done (with the occasional interruption) instead of “seeking” for the interrupt work. We also discuss how this fits in with encouraging broader employees to participate in security hygiene practices.

How to Maximize Time Savings and Reduce Toil During Incident Response

Incidents are a costly burden on businesses. Despite assembling the right people and teams, the manual work, tool setup and prolonged tasks can negatively impact customer experience. The need for adaptable processes to address diverse incident types further complicates the situation. This is where the PagerDuty Operations Cloud steps in. It streamlines and automates all the various manual steps in the incident response process.

Sponsored Post

Kubernetes Monitoring Best Practices

Kubernetes can be installed using different tools, whether open-source, third-party vendor, or in a public cloud. In most cases, default installations have limited monitoring capabilities. Therefore, once a Kubernetes cluster is running, administrators must implement monitoring solutions to meet their requirements. Typical use cases for Kubernetes monitoring include: Effective Kubernetes monitoring requires a mix of tools, strategy, and technical expertise. To help you get it right, this article will explore seven essential Kubernetes monitoring best practices in detail.

Failure Fridays at PagerDuty

Rich Lafferty, Staff SRE at PagerDuty and Stevenson Jean-Pierre, Senior Manager, Software Engineering at PagerDuty join Mandi Walls to talk about PagerDuty’s Failure Friday and Failure Any Day practices. PagerDuty has been using failure injection and chaos engineering methods to maintain the reliability of production services. Rich and SJP joined the PagerDuty live stream to talk about how the process works, how it has evolved, and how failure helps improve PagerDuty’s services.

The DevSecOps Toolchain: Vulnerability Scanning, Security as Code, DAST & More

DevSecOps is a philosophy that integrates security practices within the DevOps process. DevSecOps involves creating a ‘security as code’ culture with ongoing, flexible collaboration between release engineers and security teams. The main aim of DevSecOps is to make everyone accountable for security in the process of delivering high-quality, secure applications. This culture promotes shorter, more controlled iterations, making it easier to spot code defects and tackle security issues.

The Medium is the Message: How to Master the Most Essential Incident Communication Channels

We’ve all seen it: a company experiencing a major incident and going radio silent, leaving their customers to wonder “Are they doing something about this?!”. If you’ve ever been on the inside of something like this, you know the answer is most likely yes, there are people working hard to put out the fire as quickly as possible. But when it comes to incidents, perception is reality for customers.

Looking Beyond Atlassian StatusPage: The 5 Best Alternatives

Status Pages are crucial cogs in your Incident Communication process, they serve as vital channels to keep your stakeholders informed during periods of downtime. Although there are many proficient tools in the market, such as Atlassian Status Page and Status.io, these standalone Status Pages can come with a hefty price tag, with various pricing plans and tiers for both Public and Private Status Pages. Moreover, with Atlassian Cloud’s recent issues, its dependability is in question.

Custom fields: make FireHydrant your personalized incident management platform

Today we're releasing custom fields, a powerful new feature that empowers you to tailor FireHydrant to your organization's specific needs and capture essential incident details. Custom fields help you track critical states, involved parties, resolution specifics, affected services, messages, and more — almost anything you want! — all aligned with your unique workflows. Regardless of the size of your team or the maturity of your processes, custom fields adapt to your workflow.

AIOps and Dell's latest acquisition

Dell’s recent acquisition of Moogsoft is the most recent validation of the growing market for automated ITOps – also known as AIOps. When legacy companies such as Dell recognize the importance of AIOps it proves the technologies behind automating ITOps are now mainstream and a vital part of every modern IT management stack. We look forward to seeing how Moogsoft’s integration into Dell will play out over the coming years.

In review: Gartner Hype Cycle for Monitoring and Observability

You know it’s going to be a great day when you find yourself mentioned as a sample vendor on the well-read Gartner’s Hype Cycle report. The OnPage team is thrilled to share with its community that we have been mentioned as a sample vendor by Gartner on their latest Hype Cycle for Monitoring and Observability. Continuing its impressive streak of mentions this year, OnPage is featured as a sample vendor, specifically within the Automated Incident Response category.

Latest Developments in Monitoring and Observability, 2023

You know it’s going to be a great day when you find yourself mentioned as a Sample Vendor on the Gartner® Hype Cycle™ report for Monitoring and Observability, 2023(July 2023). The OnPage team is thrilled to share with its community that we have been mentioned as a Sample Vendor by Gartner on their latest Hype Cycle for Monitoring and Observability. OnPage is recognized as a Sample Vendor, specifically within the Automated Incident Response category.

210% ROI: unlocking the economic value of FireHydrant for incident management

In the fast-paced high-tech industry, efficient incident management is a critical factor in maintaining brand reputation, employee morale, and most importantly, your bottom line. Good practices can result in reduced downtime, increased learning opportunities from incidents, and an enhanced reputation among both the engineering community and customers. But quantifying the true cost of incidents has always been a challenge — until now.

Datadog and BigPanda: Observability and AIOps made better together

Datadog’s modern observability empowers development engineers with full-stack visibility, comprehensive instrumentation generation, and proactive alerts to accelerate software development releases and address potential incidents. While Datadog gives teams end-to-end visibility, it works even better together with AIOps from BigPanda – development teams gain insights into outside application dependencies and reliance on other systems.

10 Years of Failure Friday at PagerDuty: Fostering Resilience, Learning and Reliability

In today’s fast-paced and ever-evolving world of technology, failure is inevitable. Organizations should embrace failure as a learning opportunity for how to build and deliver more resilient services. At PagerDuty, we’ve practiced Failure Friday for 10 years now. Failure Friday–a practice inspired by the chaos engineering space–involves intentionally injecting failures into our systems to improve reliability and foster a proactive engineering culture.

The Unplanned Show, Episode 6: Defining AIOps with Heather Newburn

“AIOps” is a term some love to hate, but what makes it useful? In this episode, Heath Newburn breaks down the three things to look for in an AIOps solution: reduce noise, create context, and reduce toil. He also explains the challenges with domain-specific approaches, versus domain-agnostic approaches to AIOps. But even within that approach, Heath warns of “gotchas” in rules “tech debt”, data formats, and overall long implementation times.

We used GPT-4 during a hackathon-here's what we learned

We recently ran our first hackathon in quite some time. Over two days, our team collaborated in groups on various topics. By the end of it, we had 12 demos to share with the rest of the team. These ranged from improvements in debugging HTTP request responses to the delightful “automatic swag sharer.” Within our groups, a number of us tried integrating with OpenAI’s GPT to see what smarts we could bring to our product.

How summertime turns up the heat on cyber readiness (and what to do about it)

“Malicious cyber actors aren’t making the same holiday plans as you.” (CISA & FBI) Summertime is prime time for cyberattacks. According to one survey, 58% of security professionals believe that there is seasonality in the attacks that their company experiences every year, with the majority citing summer as high season for breaches.

In review: Gartner Hype Cycle for ITSM

The OnPage team is pleased to inform that we’ve been included in Gartner’s ® latest Hype Cycle for ITSM, 2023 report, listing OnPage as a sample vendor in the Automated Incident Response category. For those unfamiliar with it, Gartner’s Hype Cycle for IT Service Management (ITSM) highlights tools and technologies that shape the ITSM ecosystem.

What's New in PagerDuty iOS and Android Mobile Applications

The PagerDuty Operations Cloud is your platform for action in critical moments. By harnessing the capabilities of AI and automation, it has the ability to detect and diagnose disruptive incidents, assemble the appropriate team members for prompt response, and optimize your digital operations by streamlining infrastructure and workflows.

Limitless Status Page Customization - Unlocked

Maintaining a comprehensive and engaging status page is the cornerstone of an effective incident communication strategy, yet too many companies limit themselves in this respect. Some rely on an assortment of disjointed application monitoring and manual incident notifications, while others look to the cheapest status page they can find.

Enhanced Incident Response: Maximizing Microsoft Teams with Squadcast

Off late more and more businesses are relying on ChatOps tools like Microsoft Teams for a range of functions beyond simple communication. Incident management is no exception to this growing trend. However, Microsoft Teams alone may not possess all the necessary capabilities to efficiently perform these functions. To bridge this gap, integration with core applications becomes necessary.

5 Tips for Faster Troubleshooting to Reduce MTTR

In today’s rapidly evolving digital landscape, organizations heavily rely on their applications and systems to deliver optimal performance. As such, driving down the key metric of Mean Time to Resolution (MTTR) is clearly one of the biggest challenges facing observability practitioners today.

Gartner Market Guide: Embedding Automation Into the Enterprise

“Existing workload automation strategies are unable to cope with the expansion in complexity of workload types, volumes and locations driven by evolving business demand, as per Gartner. Digital business is slowed without collaboration and automation inside and outside of IT, leading to siloes of capabilities across business and IT teams.Cost optimization is an evolving challenge, driven by technical debt and requirements to demonstrate business value of investments.”

Incident Management Steps and Best Practices

According to the Uptime Institute’s 2022 Outage Analysis report, one out of every five companies has experienced a “serious” or “severe” incident over the past three years—a percentage that’s increasing. Those incidents are expensive: over 60% cost more than $100,000, while 15% set their companies back close to $1 million.

Align platform and product engineering teams over incidents

I firmly believe in never letting a good incident go to waste. Incidents expose weak spots and create opportunities for medium and long-term investments. In analyzing incidents and understanding their root causes, organizations can identify areas that require additional resources or enhancements. When incidents are used to align your platform and product engineering, it opens up opportunities to enhance the performance and security of your product.

Mastering Zero Trust - Pillars for Security

Zero Trust is a heightened security measure that blocks people and devices from accessing company data by default, only allowing access to those who prove they require it. Zero Trust assumes restricted access to company resources by all: Anyone or anything accessing company resources requires verification each time the system is accessed. There are no options to “trust this device next time” or “save password for next time”.

Managing Extreme Heat Events

A Q&A with Brian Toolan, Everbridge VP Global Public Safety Talk about the trend in heat events that are impacting state and local governments. Each year, we witness the challenges cities and towns face due to extreme heat. Some of the biggest areas of concern are the places that haven’t experienced such extreme heat in the past or for prolonged periods of time.

Templates for Automating Incident Response

A security incident is the last thing any DevOps lead wants to see. Along with the vast number of protocols required to overcome an incident, there’s a hefty amount of paperwork to complete. Security incidents can even lead to legal repercussions, if personal data is leaked. Incident response templates offer insight into: An incident response plan template drastically reduces the time and effort spent dealing with incident reports.

Unveiling Multibot, the "glue" for enterprise workflows

How are you delivering Slack incident management workflows that serve the many teams across your enterprise? How are you addressing the differences in their use cases, access needs, isolation needs, and tech stacks, all while enabling everyone to collaborate? These are challenging questions to answer. To effectively do so, you have a host of conditions to support at the team and company-wide levels: ‍ Team ‍ Company-wide ‍

Optimizing Resource Scheduling and Planning in Healthcare

The pandemic has exacerbated the staff shortage in healthcare, placing a disproportionate burden on the industry, and underscoring the significance of effective resource scheduling. While resource scheduling encompasses the allocation of healthcare staff and physical resources and assets, in this blog, our primary focus will be on healthcare staff. Resource scheduling plays a vital role in ensuring the smooth operation of healthcare facilities.

BigPanda-Cribl Integration: Stronger actionable insights within your observability data

Overwhelming volumes and varieties of observability data most businesses encounter on a daily basis is impossible for IT operations teams to manually sift through successfully. This can be a troubling reality when frequent high-value business data is required to consistently maintain the uptime and integrity of your services and applications.

July 2023 Update - New user management, Duty stand-ins, incident response in voice-calls and simplified SSO

User July update includes a new and optimized user management in the web portal and a new feature in the duty scheduler, which allows to easily create stand-ins for scheduled duty personnel. Furthermore, it is now possible to acknowledge or close Signls directly during the call. As always, all details can be found in this blog article.

How to communicate incidents using status pages

Status pages allow organizations to deliver real-time status updates on incidents and scheduled maintenance, which reduces the number of support tickets. It also brings transparency and reliability, thereby earning the trust of customers. Join our webinar to learn how Site24x7's StatusIQ is a great choice to communicate incidents to your end users and customers. In this webinar, we will answer all of your questions about status pages.

The Unplanned Show, Episode 5: DataOps with Snowflake

Long gone are the days when data is batch loaded into a data warehouse for business intelligence reports that are looked at periodically and if something is broken, a few internal people would have to wait. Today, data pipelines are “infinitely more complicated”, with more sources from cloud services to on premises systems, and supporting data applications that are critical parts of a business’ ecosystem.

Critical Incident Management - Roles and Responsibilities

Critical Incident Management is designed to handle disruptive and unexpected events that threaten to harm an organization or its stakeholders. These incidents range from cyber attacks and system failures to natural disasters and global pandemics. The importance of critical incident management cannot be overstated, as it is a pivotal process that maintains business continuity and ensures smooth operations despite adversities.

8 Tips to incorporate the voice of the customer in your story grooming/sprint planning

Creating successful products and projects goes beyond just great ideas and flexible processes. It's about truly understanding and listening to your customers.Attentively listening to their wants and needs unlocks invaluable insights that can revolutionize your story planning and project execution. In this blog, we'll look at easy but powerful tips to use the customer's input during story planning.

How we leverage our product responder role to push our pace of development

Like many of our own customers, at its heart, incident.io is a software company. Because of this, it means that our work is never truly “done." One of our primary goals is to help people coordinate their response to situations where things haven’t gone well, and make it easy to always do the right thing. But we know that there will always be bugs to fix, features to be introduced and improvements to be made, as evidenced by our changelog.

How Incident Tracking Can Benefit Your IT Organization

In the dynamic world of Information Technology (IT), incident tracking is a critical process within the realm of incident management that can significantly influence an organization’s operational efficiency and service quality. Incident management refers to the identification, recording, and management of incidents—unplanned events or disruptions—that can impact IT services.

How our engineering team uses Polish Parties to maintain quality at pace

It’s fair to say that delivering software faster has never been more relevant. But in doing so, it’s easy to let your bar for quality slip. Often, the guardrail to avoid this is to hire dedicated QA Engineers, whose sole job is to ensure your software works as it should and to spot any issues that arise. Seems sensible, right? Well, at incident.io, we take a different approach.

What Is Site Reliability Engineering? Understanding the complexities of this crucial function

Site reliability engineers manage a lot, and often in incredibly high-stakes environments. Remember that scene from "The Matrix" where Neo dodges bullets in slow motion? Of course you do. As an SRE, it can feel like you're the person getting hit by those bullets, frantically trying to investigate performance issues, automate away toil, and support the engineers around you, all before the next wave of attacks.

Share highly customizable Blameless Retrospectives as ServiceNow Problems

For many organizations, ServiceNow is a crucial platform to run and scale your organization across all departments. Many organizations’ engineering teams have been relying on ServiceNow Incident and Problem Management. Despite that, many have been experiencing a growing volume of incidents hindering their ability to scale not only their incident response but also their retrospective operations, potentially compromising their data governance and compliance requirements.

How we achieved pixel-perfect polish during our Status Pages launch

A few months ago, we released Status Pages. This project was quite different from anything we’ve approached before, given that: And our goals were a departure from one's we had set in the past: With this in mind, we worked closely with our designer throughout the process of building Status Pages. Here is how we approached it and a few lessons we learned along the way!

Catalog vs. Thanos: Who came out on top?

Catalog is really, really powerful. To prove it, our latest product went up against the almighty Thanos and won decisively. Don’t believe us? Just look at how unscathed Catalog was once the dust settled: All jokes aside, we spent months building out what, we think, is one of the most capable products on the market today. Designed to be a map of everything that exists in your organization Catalog can meaningfully help you level up your incident response.

Powering ConnectWise PSA With a New Alerting Workflow

In our previous blog from the ConnectWise series titled “OnPage-ConnectWise Incident Alert Management Workflows,” we discussed how customers are optimizing their investments in ConnectWise PSA. Now, we’re excited to present a new and powerful workflow specifically designed for after-hours that addresses the evolving needs of IT and Managed IT clients.

The Unplanned Show, Episode 4: Sriram Subramanian on Responsible Generative AI

Generative AI is a rapidly-evolving ecosystem with a lot of attention. In this episode, Dormain Drewitz asks Sriram Subramanian about the main challenges to responsibly implement generative AI, including content that’s harmful, inaccurate or violates privacy or security standards. Sriram discusses Microsoft’s 6 tenets to responsible generative AI, as well as the notion of shared responsibility between platform providers and foundational LLMs and the developers and data engineers building on top. Sriram also answers questions about where to get started safely with generative AI and shares his framework for identifying opportunities to add value.

Improve Visibility and Capture More Data with Triage Incidents

As new incidents emerge, there are often many unknowns about the size, severity, and cause of the problem. Sometimes it’s not clear if the problem is an incident at all. That’s where introducing a triage stage to your incident management process can help. In this post, we’ll look at the benefits of adding a triage layer to your incident management, and how Rootly’s Triage feature allows you to seamlessly transition from triage to real incident (or false alarm).

Understanding Chaos Engineering and its Benefits

In today's fast-paced technological landscape, ensuring the resilience and dependability of systems is crucial. This is where Chaos Engineering comes in, transforming how organizations approach system testing and fortification. Chaos Engineering helps find vulnerabilities that could go undetected under normal circumstances by purposefully introducing controlled interruptions and failures.

MTTR vs. MTBF vs. MTTF: Understanding Failure Metrics

In the dynamic landscape of software and web applications, failures can have severe consequences, impacting user experience, business continuity, and overall performance. To proactively address these challenges, organizations rely on robust monitoring practices supported by failure metrics. Failure metrics, specifically tailored to software and web application monitoring, provide crucial insights into system health, reliability, and optimization opportunities.

The Importance of Log Monitoring for Incident Response

In the face of growing security threats and incidents, businesses must prioritize their ability to detect, investigate, and respond effectively. Timely incident response is crucial for maintaining the security and integrity of systems and data. Among the essential tools in the incident response arsenal, log monitoring stands out as a critical component. By closely analyzing logs, organizations gain valuable insights into system events, user activities, and network traffic.

26 DevOps Automation Tools that SaaS Loves in 2023 | Blameless

DevOps is a term combining “development” and “operations”. It involves the use of tools and processes to minimize the time and effort spent on software creation and maintenance. Many DevOps technologies use automation to reduce manual tasks. These DevOps automation tools sometimes use AI-based technology to remove human-based operations, or simpler scripting and processing. This increases speed in feedback and performance between development and operations departments.

SIGNL4 Onboarding: Alert Notifications & Handling

The SIGNL4 Onboarding series walks users through the process's of SIGNL4 from Signup to Alerts to Settings. Today's video focuses on receiving alerts and all of the options available inside of your SIGNL4 alerts. This video is packed with helpful tips to help you get the most out of your account.

Unleash the true power of AIOps with BigPanda New Generative AI

IT response teams find themselves battling against an overwhelming onslaught of incidents. Frustratingly long response times, challenges with prioritization, and the relentless pursuit of root cause are formidable adversaries that test even the most skilled teams. I remember customers’ electrifying anticipation with AI and automation a decade ago. They hoped AI could be used to instantly decode the business impact of incidents and automation to respond to incidents without human intervention.

PagerDuty Extends Operations Cloud Leadership into AIOps and Automation

Forrester Names PagerDuty a Leader in first-ever Process-Centric AIOps Wave From helping pioneer the DevOps movement to establishing best practices around service ownership to being the standard in incident response, PagerDuty has a long history of leadership. PagerDuty is honored to add to this list and now be recognized as a leader in the AIOps and Automation space by Forrester.

The differences between reactive vs proactive incident response

Most commonly, businesses take a reactive approach to incident management. After all, the concept of incident response seems inherently reactive. However, it is possible—and often necessary—to take more proactive measures. This entails identifying potential problems and taking steps to remediate them before they become incidents.

Effective incident escalations

In the ever-evolving digital landscape, every organization must confront its fair share of incidents. Regardless of the sector or size, one common thread weaves through them all: the need for effective incident management. A crucial part of this management is incident escalation, a topic on which we've had many discussions with various companies.

5 Takeaways from Gartner's Latest AIOps Analysis

If you’re still unpacking the latest terminology from Gartner’s 2023 AIOps market update, you aren’t alone. Subject matter experts from Moogsoft recently joined thought leaders from TIAA and Windward Consulting for a debrief on the panel interview Accelerating Your AIOps Journey Webinar. Almost half of technology leaders looking to improve productivity and fuel greater collaboration are struggling to explain AIOps use cases, benefits, and value to other business leaders.

Incident severity: why you need it and how to ensure it's set

Defined severity levels quickly get responders and stakeholders on the same page on the impact of the incident, and they set expectations for the level of response effort — both of which help you fix the problem faster. But sometimes, for whatever reason, a severity level just doesn’t get set. Maybe there’s confusion around what severity level to use. Or maybe you have a low barrier to declaration and your responders just need a little nudge.

Sponsored Post

Improve MTBF and MTTR for your Application Platforms by using MESH Observability

When businesses look at how best to understand the performance levels of their platforms, some of the best incident management metrics to look at are Mean Time Between Failures (MTBF) and Mean Time ToResolution(MTTR). These two measurements will give an excellent indication of the health and speed of the system, as well as the ability of the platform to take care of any anomalies that have been detected or to flag them up for others to take action to resolve them.

Carrier reduced MTTR and gained visibility across multiple IT environments

Hear Rich Johnston, Director of Hosting Platforms, describe Carrier’s observability goals to create a unified view of their IT environment for predictive monitoring. Rich describes Carrier’s desire to see issues before customer complaints, and how LogicMonitor implemented extensive visibility on a single platform, including multiple cloud platforms, networking, compute, storage, and more. LogicMonitor helped Carrier quickly and easily deploy dashboards to see how their technology performed, while reducing root cause analysis and shortening resolution time.

Tips on making on-call manageable

On-call responsibilities are a crucial part of many industries, ensuring that businesses can provide round-the-clock support to their customers. However, the demanding nature of on-call duty can lead to burnout and reduced productivity if not managed effectively. In this article, we will explore various strategies and tips to make on-call more manageable, enabling professionals to maintain a healthy work-life balance and deliver exceptional service.

The Incident Response Lifecycle: Strategies for Effective Incident Management

The nature of security and incident management is cyclical rather than linear. Resolving an issue doesn't mark the end of the team's responsibilities. Instead, it signals the opportunity to enhance reliability, strategize, prepare, and prevent similar problems. This is where the incident response helps and comes into the picture. But what is incident response, and what steps are included in the incident response lifecycle? Let's understand them in detail.

Docker Compose Logs: Guide & Best Practices

Docker Compose is a tool for defining and running multi-container Docker applications. It allows developers to streamline the process of configuring, building, and running multiple containers as a single unit with a docker-compose.yml. This configuration file specifies the services, networks, and volumes required for an application, and their relationships and dependencies. The docker-compose logs command displays the logs of all services defined in the docker-compose.yml file.

How Schneider Electric reduced MTTI and alert noise by consolidating monitoring tools

Hear Observability and Monitoring Strategist, Arun Mandayam, describe challenges that Schneider Electric faced around data interpretation and difficulties when using multiple monitoring tools. Arun describes how LogicMonitor helped consolidate monitoring tools, enabled them to onboard new cloud accounts, network devices, and on-prem systems on a unified platform, and helped significantly reduce MTTI and alert noise.