Operations | Monitoring | ITSM | DevOps | Cloud

April 2023

Sponsored Post

Scaling Site Reliability Engineering Teams the Right Way

Most SRE teams eventually reach a point in their existence where they appear unable to meet all the demands placed upon them. This is when these teams may need to scale. However, it's important to understand that increasing team capacity is not the same as increasing the number of people on the team. Let's unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you're done.

Measuring organizational resilience: tools, techniques, and best practices

It is no surprise that resilience has become a frequently identified trait for success. McKinsey stated, “To thrive in the coming decade, companies must develop resilience—the ability to withstand unpredictable threat or change and then to emerge stronger. However, how can organizations measure their resilience in the first place? Strengthening resilience requires organizations to take a step back and assess how they measure up to their competitors and what processes need the most attention.

Forgot to declare an incident? Add it retroactively in FireHydrant.

Have you ever quickly worked through an issue with your team and later thought, “Huh. That probably should have been an incident.” It happened to us just a few weeks back. After one of our engineers surfaced a failed build, a few folks chimed in to problem solve and within 30 minutes things were up and running like normal. But we probably should have declared an incident.

New Features: Next-Generation Notifications UI, Take-On Call Widget, Alert Templates, Dynamic Policy Routing, Service Groups

This post highlights some of the features and improvements that we have released in the last two months. If you want to submit your own ideas or vote on existing feature requests, you can now use our public roadmap at roadmap.ilert.com.

SIGNL4 Onboarding: Scheduling - Creation & Options

The SIGNL4 Onboarding series walks users through the process's of SIGNL4 from Signup to Alerts to Settings. Todays video focuses on Scheduling users for duty shifts. Learn how to schedule users for SIGNL4 shifts and about the scheduling options and how they affect your team and schedule. Learn how to create a schedule and then copy this schedule so you only have to create it once. This video is packed with helpful tips to help you get the most out of your account.

How to get started with BigPanda Incident Intelligence and Automation powered by AIOps

If you’re in IT operations or manage NOC, SRE, and DevOps teams, chances are your IT environment is growing complex for you and your teams to manage. Any enterprise, large or small, around the globe, is continuously changing its IT stack due to evolving business requirements and significant industry trends. But digital transformation, hybrid infrastructure, DevOps adoption, and continuous integration and continuous delivery (CI/CD) pipelines are all causing major headaches.

Velocity vs. Cycle Time: Which Metric is Right for Your Team?

In the world of agile development, tracking the progress of work is a critical aspect of the development process. Velocity is a metric that is often used to measure how much work a team can complete in a given period. Velocity is a measurement of the average number of story points (or another unit of work) completed by the team in a sprint. The idea is to track the velocity over time to help the team plan how much work they can realistically complete in a sprint.

The Dangers of Alert Fatigue: Strategies for Effective Alert Management

Alert fatigue is a serious issue that affects numerous professions, especially in the IT industry. It can lead to neglecting critical events and delaying response times. IT teams need to continuously monitor their systems and applications to avert possible downtime and keep operations running smoothly. However a high number of incoming alerts inundating these teams can make them less responsive. The ramifications of such disregard can severely affect the efficiency and dependability of IT teams.

User story: How a global media company reduced costly outages by implementing a secure DevSecOps collaboration platform

Catastrophic failures — such as a security breach or a complete outage leading to an unavailable product or service — are classified as Sev0 incidents. On a severity scale of 1–3, Sev0 is dire. It brings business to a complete standstill and may lead to loss of revenue and a damaged reputation. A Sev0 incident usually has no quick workaround; it requires a coordinated effort beyond the engineering team to diagnose, correct, and manage.

Welcome To xMatters - Ep 1 - Connecting Your Tools

When help is needed, xMatters ensures the right message reaches the right people at the right time. Our service reliability platform gives teams the superpowers to choose from hundreds of free downloadable workflows, connect their favorite tools, and level up their incident response process so issues are fixed before they can impact customers.

Should Every Incident Get a Retro?

At a recent training session, Jeli spent a great deal of time covering incident retrospectives and what makes an incident worthy of studying. My colleague Ben Hartshorne asked a fascinating question, which I’ll paraphrase here: That caught me by surprise. We had a great discussion, and it made me consider approaches I hadn’t before.

9 incident management solutions to improve your workflows

Incident management is a team effort. While it's true that incident management should be seen as a company-wide effort, and you should empower all teams to declare incidents, this differs from the team effort I'm referring to here. No, incident management is a team effort in the sense that no one tool can do it all, not even incident.io. We covered as much when we discussed why we integrate with tools that can be seen as our competitors – and that’s OK!

8 Best IT Monitoring Tools and Software of 2023 (Updated)

Monitoring tools, also known as observability solutions, are designed to track the status of critical IT applications, networks, infrastructures, websites and more. The best IT monitoring tools quickly detect problems in resources and alert the right respondents to resolve critical issues. Response teams use observability solutions to gain real-time insights into resource availability, stability and performance.

Install Prometheus on Kubernetes: Tutorial & Examples

As one of the most popular open-source Kubernetes monitoring solutions, Prometheus leverages a multidimensional data model of time-stamped metric data and labels. The platform uses a pull-based architecture to collect metrics from various targets. It stores the metrics in a time-series database and provides the powerful PromQL query language for efficient analysis and data visualization.

Easier, Leaner, and a more reliable Status Page

Our status page product started last year as an experiment. We built a status page product in a hurry over weekends, and to our surprise, it gained a lot of traction. People were using it and giving us feedback, which helped us improve the product over time. And this year, we're thrilled to announce that we have great things planned for our status page product! The new revemped dashboard is part of a larger plan for our status page product. Here's a quick gist of the multiple releases.

Battling database performance

Earlier this year, we experienced intermittent timeouts in our application while interacting with our database over a period of two weeks. Despite our best efforts, we couldn’t immediately identify a clear cause; there were no code changes that significantly altered our database usage, no sudden changes in traffic, and nothing alarming in our logs, traces, or dashboards. During that two-week period, we deployed 24 different performance and observability-focused changes to address the problem.

Top 3 Incident Response Problems AIOps Can Help Your Teams Solve

More data for data’s sake doesn’t help anyone. What organizations need is more information–actionable insight. With data coming from incoming streams of events and alerts, teams don’t have enough time to look at each one. And they struggle to parse and consolidate this data in order to figure out what they need to do next to resolve an incident.

The seven key resilience findings of the most resilient EMEA organizations

Resilience is more than just a goal that organizations strive to achieve. With an increased number of critical events, including cyber-attacks, extreme weather and violent crime, resilience is vital for the short-term and long-term success of any operation. Everbridge and Atos sought out to find the links between resilience and success, with a report from Dr. Stefan Vieweg, Director of the Institute for Compliance and Corporate Governance (ICC) at the Rheinische Fachhochschule in Cologne, Germany.

How we built it: incident.io Status Pages

We kicked off 2023 with a new team and a new product to build - Status Pages. We wanted to build a solution we could ship to customers as quickly as possible, while making sure that it’s reliable, fast and beautiful. Here’s how that process played out over the course of three months.

Reduce MTTR and Take Automation to a New Level with PagerDuty Global Event Orchestration

PagerDuty’s Global Event Orchestration is now generally available. Global Event Orchestration’s powerful decision engine enriches events, controls their routing, and triggers self-healing actions based on event data. Teams can use this functionality across any or all services within PagerDuty. This feature is a continued investment in Event Orchestration, demonstrating PagerDuty’s commitment to providing customers with best-in-class automation capabilities.

The rise of ServiceOps: unifying IT service delivery

With the complex and steadfast growth of IT service delivery processes, organizations and their internal teams have come to rely on several tools in their toolbox to deliver best in class products and services. The use of AIOps, AI/ML, and overall automation has shaped modern delivery methods, but what we call this process, and how we grow to advance it, has yet to find a definition that’s universally recognized.

Teamwork Without Borders: How to Create a Strong Team Culture Across Time Zones

Working across different time zones can present significant challenges when it comes to fostering a team culture. I came across a typical scenario in a geographically distributed team with their Engineering team members based in New York and Poland. They are set to welcome a new Director of Engineering based on the West Coast. With minimal daily overlap between the teams, the question arose about how to create and manage their team culture.

Announcing incident.io Status Pages - powering clear external comms to build trust

Clear and frequent communication carries considerable weight in today's era of hyper-competition among businesses—especially during incidents. Because of this, status pages have become the go-to choice for companies looking to prioritize trust, transparency, and clarity with their customers, even during downtime. Unfortunately, current status page solutions have made these communications particularly frustrating and stressful.

IT Incidents vs. Alerts

IT incidents are events which lead to a disruption or deviation from the regular operating standards of a computer system or network. They can be caused by various factors, including hardware or software failures, human error, or even deliberate external (cybersecurity) attacks. It begins with short delays, or services cutting out - for example, when a website or server is down, or access to data(bases) takes too long.

Incident Response Guide

Site reliability engineering (SRE) is a critical discipline that focuses on ensuring the continuous availability and performance of modern systems and applications. One of the most vital aspects of SRE is incident response, a structured process for identifying, assessing, and resolving system incidents that can lead to downtime, revenue loss, and brand reputation damage.

Automated Incident Management

Automated Incident Management is the process of automating some or all these tasks through various means. Automated incident management can improve incident response time, reduce unnecessary work, such as when an issue is a minimal impact. AlertOps can help automate incident management by creating tickets in help desk systems, filtering and rules, and escalating alerts.

Alarm Notification Software: SIGNL4 is test winner

The renowned German manufacturing magazine “Factory Innovation” recently conducted a comprehensive practical test on four leading alarm notification software for industrial manufacturing in their latest issue (01/23). The four alarming systems that were evaluated include: the Alarm Control Center from Alarm IT Factory (a spin-off of Siemens AG), ALERT 4.0 from Micromedia, the Alarm and Information Portal (AIP) from VIDEC, and SIGNL4 from Derdack.

Our A, B, Cs of external communications

Communication carries more weight than ever before. Businesses are so much more connected to their customers given the number of mediums they can communicate through; Twitter, Instagram, Facebook, and even TikTok. Because of this, it's essential to prioritize these lines of communication throughout your day-to-day. Some might even say that over-communicating is the best way forward. Why? No one likes a company that appears simply like a black box with zero insight into what's happening.

Time to Resolution: What is it, Why You Need it, And How to Calculate it

Ready, set, go: when it comes to customer service, it's a race against the clock. Customers expect lightning-fast responses and complete solutions to their problems. But what happens when your help desk can't keep up with the pace? The answer is simple: frustration, dissatisfaction, and potentially lost clients. That's why measuring and improving Time to Resolution (TTR) is crucial. As a customer, there's nothing more irritating than dealing with a slow or ineffective help desk.

How to prepare for, deal with, and recover from IT outages

The average cost of an IT outage is $12,900—per minute. And when it comes to a “significant outage,” organizations reported the average overall cost was a whopping $1,477,800. On the latest podcast episode of That’s great IT, I spoke with Scott Lee, AVP for infrastructure and ITOps at Arch Mortgage Insurance Company, part of Arch Capital Group, about how organizations can best navigate IT outages.

Global Event Orchestrations Demo

Frank Emery, Principal Product Manager, joins the Twitch stream to talk about and show off enhancements to Event Orchestration, featuring the new Global Event Orchestrations feature. Global orchestration rules will enable your organization to suppress, annotate, and customize events for all services in your PagerDuty account. This new feature is available to all accounts with AIOps plans.

Transforming Incident Management with KPIs: A Comprehensive Guide

In modern times, the significance of digital experiences cannot be overstated across various industries. Thus, a well-designed and effective incident management system is essential to ensure the smooth running of businesses and prevent any revenue loss. The ability to respond and resolve incidents promptly enhances the dependability and trustworthiness of businesses in the eyes of their users. Conversely, failure to handle incidents efficiently can lead to negative consequences.

Admin Panel - Custom User Properties - xMatters Support

You can use custom user properties to store additional information about people your organization. You can use this information to sort, find, and organize users, as well as to notify teams based on particular criteria, like a specific skill set. Custom user properties are configured in the Admin or Settings menu and appear as optional or required fields in each user's profile.

The why and how behind running incident response game days

In any high pressure situation, the key to fast action is preparedness. And that’s true when it comes to incidents, too. Documenting and training your team on your incident response processes is essential to ensuring a coordinated and efficient response effort. And training sessions, or game days, as they’re sometimes called, are one way to get everyone up to speed.

Introducing PagerDuty AIOps: Harnessing the Power of AI to Transform Modern Operations for the Enterprise

Today, PagerDuty launched a new AIOps solution to leverage the power of AI, provide built-in automation and build on the company’s foundation data model to transform modern operations for the enterprise. PagerDuty has long suppressed noise to help distributed development teams focus.

Is your incident management solution creating more problems than it solves?

When it comes to incident response, the ability to adapt and customize your approach is key. Every organization has unique needs and workflows, and a one-size-fits-all solution simply won't cut it. That's why Blameless is proud to offer a flexible platform that allows teams to tailor their incident response process to fit their exact requirements.

Building a culture of incident response

At Vanta, our goal is to nurture a positive security culture in everything we do—which is especially critical given that helping our customers improve their security and compliance posture starts with our own. Employees are the key to our security resilience, so we strive to build and support a strong culture of incident response in tandem. Here’s what that means to us at Vanta.

Four Years as a Public Company

Four years ago tomorrow, our team rang the bell to open the NYSE for PagerDuty’s IPO. We spent two weeks traveling to meet hundreds of prospective investors in person, sustained by a diet of Cheetos and green M&Ms, sneaker-clad walks to meetings, and unwinding with bad karaoke. We’ve grown in many ways in our first four years as a public company. We have more than doubled the number of customers on the PagerDuty platform, and nearly tripled the number of users.

How to enrich IT alerts and add context with Data Engineering

I see it daily in my role, IT organizations are paying for best-of-breed monitoring tools but struggle to tie the pieces together between these siloed systems. The wound of these silos is further punctured when incidents arise. Incidents are costly for so many reasons, like wasted company resources, potential revenue loss, customer satisfaction, employee burnout, etc. This is exactly why BigPanda exists, to apply AI to the complex problems IT operations, NOC, SRE, and DevOps teams face daily.

Incident Response Playbook

In today's digital age, IT departments play a crucial role in maintaining the overall functionality and security of an organization. One essential tool for managing service outages and downtime is the incident response playbook. This comprehensive guide provides IT departments with the necessary processes and strategies to resolve incidents in a timely and efficient manner.

Time to Upgrade? Why Traditional Pagers Are No Longer Enough

When it comes to time-sensitive events, instant, reliable communication is key. In the past, pagers were relied on for quick communications as they allowed people to communicate on the go and without access to a landline. But today, the availability of cellphones has made the portability of communication devices a standard feature, and communication technology has advanced significantly, begging the question – What is the use for pagers today?

Create a service catalog that grows with you

When your incident response process is centered around a service catalog, responders are able to more quickly pinpoint the service or functionality that’s down, bring in the team or experts, and then get to solving the problem faster. Saving even a few minutes can have a big impact on decreasing the costs around incidents and outages, so having up-to-date service details at your fingertips can make all the difference.

Squadcast + HaloPSA Integration: Enabling Streamlined Incident Response & Alerting

HaloPSA is a modern and intuitive all-in-one professional services automation (PSA) solution, designed for service providers. HaloPSA’s cloud platform helps you manage your entire business, modernize customer experience and automate your service. If you use HaloPSA for PSA requirements, you can integrate it with Squadcast, an end-to-end Incident Response and Reliability Workflow platform, to route detailed alerts from HaloPSA to the right users in Squadcast.

Developer environments should be cattle, not pets

Cattle, not pets is a DevOps phrase referring to servers that are disposable and automatically replaced (cattle) as opposed to indispensable and manually managed (pets). Local development environments should be treated the same way, and your tooling should make that as easy as possible. Here, I’ll walk through an example from one of my first projects at incident.io, where I reset my local environment a few times to keep us moving quickly.

Admin Panel - General Settings - xMatters Support

You can define the details for a company using the General Settings page accessed via the Admin menu. Depending on your permission level, you may not be able to view the General Settings screen. In addition, the settings you see on this page depend on both your role permissions and the features available in your product plan.