SaaS is exploding and so it should; it takes commoditized work and infrastructure away from tech teams so that they can focus on differentiating features. But what happens when it goes wrong? How do SaaS platforms make sure they aren't letting their customers down and in turn, letting their customers down? Observability, bolstered with AI gives all the partners the best chance to optimize availability and customer experience. Here's how.
Although IT incidents have always been a concern, the increase in customer-facing technology adds the cost of a bad customer experience to the cost of responding to and remediating an incident. While in a perfect world, you’d be able to prevent incidents from happening in the first place, the reality is they do happen and more often than most of us would like to admit.
Across the globe, in-person technology events are beginning to emerge from their pandemic hibernation. For developers and DevOps teams, no event has been more anticipated than AWS re:Invent, which is back in Las Vegas, November 29th — December 3rd to help bring us all back together and slowly let us find our new normal. While handshakes may be replaced by elbow bumps or other newfound greeting rituals, we are excited to be back and see all of you in real life.
No matter how much you try to avoid it, incidents are bound to happen. And while your first instinct is to resolve the issue, it shouldn’t be your only priority. By solely focusing on solving the problem and not communicating it to affected stakeholders, like team members and customers, you’re actively making the situation worse. In this article, we’ll discuss what’s incident communication and how to create a strong incident communication plan.
At DevWeek Austin, we discussed how AI and ML have come to the DevOps toolchain and are a great fit! Here are the 3 main takeaways.
As announced at the User Group Meeting 2021, we are now releasing Enterprise Alert 9.1. This version brings a set of new features extending the capabilities in some crucial areas. Here is what’s new in a nutshell: As always you will find more details, release notes and downloadable installer files in the online user group. You can also watch the session from our UGM (no cookie embedding): Watch this video on YouTube
Six tips on how Site Reliability Engineers (SREs) can prepare for the reliability challenges of Black Friday and Cyber Monday 2021
When an incident strikes, an organization’s reputation and revenue, as well as customer trust are at stake. Assembling an effective incident response team is critical to minimizing the incident’s impact. But what exactly is an incident response team? Who should be a part of the team and what are their responsibilities? Successful incident responses require a team with a diverse set of problem-solving and communication skills.
The AIOps market is flourishing and the new year is coming up, so let’s take a look at the top 3 trends to watch out for in 2022.
Burnout from work is proven to have a tangible impact on your physical health and happiness. Learn how to recognize burnout in yourself and your employees, and build a happy developer culture!
This article should give you a first idea of what SIGNL4 does. What do IT security, production monitoring and technical field service have in common? In all these scenarios, the right people need to get notified immediately – in case of technical malfunctions, urgent maintenance orders or emergencies, all in order to solve any incident quickly and efficiently.
With Halloween behind us and the holiday shopping season fast approaching, engineering and product teams know what that means: code freezes! At xMatters, code freezes are a part of our product release process in anticipation of the busiest — and most important — time of the year for many of our customers. But code freezes are just one piece of the puzzle in how we ensure our customers have the most reliable experiences. The way our product releases are designed is much more than that.
At incident.io, we're acutely aware that we handle incredibly sensitive data on behalf of our customers. Moving fast and breaking things is all well and good, but keeping our customer data safe isn't something we can compromise on. We run incident.io as a multi-tenant application, which means we have a single database (and a single application).
In today’s digital economy, seconds matter. For mission-driven organizations, seconds can be a matter of life and death, and service reliability can make or break access to suicide and safety hotlines, disaster relief, time-critical health care, food assistance, and more. That’s where real-time digital operations comes in.
A history of Site Reliability Engineering from its origins at Google in 2003 to the present.
Fast build times are great, which is why we aim for less than 5m between merging a PR and getting it into production. Not only is waiting on builds a waste of developer time — and an annoying concentration breaker — the speed at which you can deploy new changes has an impact on your shipping velocity. Put simply, you can ship faster and with more confidence when deploying a follow-up fix is a simple, quick change.
Complex incidents are both exhausting and commonplace. In this case, incidents that I am referring to as “complex” are incidents that involve multiple, disparate, notifications in your alert management platform. Perhaps these incidents are logically separated because the underlying systems or services were seen as less coupled than they turned out to be in reality.
In this article, we’re exploring how status pages can help you deliver bad news to customers in a “good way,” starting with the psychology of news delivery and how you can use this knowledge for future incidents.
Deck the halls! It's time for the annual holiday Code Freeze, that festive time of year when businesses impose a precautionary halt to code changes and Operations should be quiet. But before you kick up your feet, make sure that demand doesn’t lead to availability embarrassments. After all, retail experts suggest that we’re in for another online-heavy holiday shopping season, so businesses need to brace for increased digital traffic...with little tolerance for failure.
Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams to solve unexpected and challenging problems. In my career, I've found incidents can be a great accelerator - for both myself and others around me. It was after leading my first incident at GoCardless that I started to feel really comfortable in the codebase and the team.
Modern businesses are digital businesses—so managing your business means mastering your critical services and operations for your employees and customers. Today, you need to be able to understand every aspect of your company—as it unfolds—because in this world, seconds matter to your productivity, your revenue, and most importantly, your customers.
The world is changing, and with great change comes an evolving threat landscape. Increases in physical and digital disruption, such as civil unrest, cyberattacks, severe weather events, and unplanned outages, have left many industries scrambling to secure a robust operational resilience strategy, including the cellular industry. Today’s evolving threat landscape poses a unique threat to cellular carriers, whose business is growing at a breakneck pace.
BASF is the largest chemical producer in the world with a revenue of EUR 59bn, 247 manufacturing sites and 110,000 employees. BASF’s Coatings division employs 11,000 people and develops, produces and markets innovative solutions for automotive OEM and automotive refinish coatings and industrial coatings as well as architectural coatings and related coating processes.
As infrastructure stacks grow increasingly complex and involve an ever-growing number of services, system failures are becoming more and more common. There can be a variety of reasons why systems fail: software bugs, misconfiguration or interactions between services that cause unexpected behavior, the network is down, and of course, those rare occasions where natural events can render data centers inoperative.
“Service outage! Help!” These words (or their variations), have preceded notable losses of millions and billions of dollars in the 21st century. From large corporations to SMBs, no one is immune to the effects of downtime – whether planned or unplanned. However, the earlier an issue is noticed, the faster it is acted upon and resolved, resulting in little or no customer impact.
Follow these steps to write a great SRE job resume.
IT organizations are challenged with delivering quick, effective resolution to customers’ database, hardware or software downtime issues. Contractually binding service-level agreements (SLAs) place further pressure on IT engineers to accelerate incident resolution time and minimize downtime. Though engineers are obligated to meet their SLAs, they are unable to do so without the help of an automated alerting system.
We're a small team of engineers right now, but each engineer has experience working at companies who invested heavily in observability. While we can't afford months of time dedicated to our tooling, we want to come as close as possible to what we know is good, while running as little as we can- ideally buying, not building. Even with these constraints, we've been surprised at just how good we've managed to get our setup.
As the holiday season aggressively approaches I want to perform a public service announcement for everyone toying with the idea of a code freeze for the holidays: please don't. It’s getting cold outside and the season of peppermint mochas is upon us, which might get you thinking about putting a code freeze in place for the holidays. A Word of warning: instituting a code freeze may have unintended consequences.
“Thanks to Enterprise Alert and the acknowledgement function, we can track the alerting and response digitally and have the certainty that our employees always take care of incidents in our critical IT infrastructure in a timely manner. IT alerting with Derdack, which has to be documented according to BaFin KRITIS, is highly reliable.”, Markus Reusch, Product Owner Monitoring, Debeka
While service incidents can be wildly dissimilar, they tend to have one thing in common: a need for quick resolution. Response teams need a robust, repeatable process to follow that ensures fast, mistake-free execution, especially for those 4 AM calls. Having a documented checklist saved where the entire team can access and use it at any time could make the difference between quick resolution or compounding the problem.
The concept and development of DevOps have significantly changed the way IT teams work in the last decade. Small and large teams alike can see the difference when they switch from traditional software development cycles to a DevOps cycle: accelerated innovation, improved collaboration, faster time to market. And the list of benefits continues to grow. To effectively embrace DevOps, however, is not an easy task. Thankfully, there are ways to navigate this challenging journey.
A critical part of managing modern software development is setting up and running an on-call rotation. But that often involves significant toil, in part because many of the existing tools are cumbersome and not developer-friendly. That’s why we’re excited to announce Grafana OnCall, an easy-to-use on-call management tool that will help reduce toil in on-call management through simpler workflows and interfaces tailored for devs.
At incident.io, we ship fast. We're talking multiple times a day, every day (yes, including Fridays). Once I merge a pull request (PR), my changes rocket their way into production without me lifting a finger. 💅 It's when we tackle larger projects that this becomes a bit more complicated. We recently launched Announcement Rules, which let you configure which channels incident announcements are posted in depending on criteria you define.
The world is moving fast, led by an ever-accelerating IT landscape. In recent years, two distinct types of teams have emerged that assist in driving this business transformation: DevOps/SRE teams that are in charge of driving rapid innovation of products and services, and IT Ops/NOC teams that focus on preventing outages and maintaining the high level of quality, reliability and serviceability that modern, discerning customers expect.
We all know one bad experience can impact a customer’s perception of—and even willingness to deal with—an organization going forward. That’s why so many companies, in virtually every industry, have made investing in customer experience (CX) a top priority, according to ResearchAndMarkets.com. The problem is, for any given organization, there are a number of customer service processes along the entire life span of an interaction that need to be looked at and made great.
In this digital era, technology systems are becoming increasingly complex. No longer can a single SME (subject matter expert) understand every facet of the system they run. Instead, much of this knowledge is siloed and exists as tribal knowledge within certain teams. Additionally, the rate of change is faster than ever, with code deploying and new services shipping at a rate unimaginable a few years ago.
An explanation of the meaning of SLA, SLO and SLI, and how SREs should use each concept to manage reliability.
Cloudflare is a global cloud services provider that is based all over the globe, from San Francisco, US to London, England to Sydney, Australia. Their mission, as stated front and center on their homepage, is to help build a better Internet. While that may read like hyperbole, their numbers are impressive - Cloudflare has over 126,000 paying customers and 95% of Internet Users in the developed world are within 50ms of their network.
xMatters is a crucial tool for DevOps teams, and no one knows that better than our customers. Over the years we’ve published countless DevOps case studies, but when it comes to the test of time, some have stood up and have continued to make an impact.
WALTHAM, Mass., Nov. 3, 2021 — OnPage Corporation, a Boston-based incident management company, today announced the availability of new integrations with leading single sign-on (SSO) solutions Okta and OneLogin. The latest integrations allow for a secure authentication process when users log in to the OnPage system using their SSO account credentials.
Our November update introduces new team settings and, along with them, entirely new options for escalating Signls. This will allow you to make your incident response even more reliable. One application is to create a ‘managers on duty’ teams with full duty scheduling capabilities and escalate missed Signls to such 2nd level response team. As always, you can find all the details in this article.
Across the globe, both public and private sectors are more concerned than ever about addressing climate change and its associated risks. “In the period 2000 to 2019, there were 7,348 major recorded disaster events claiming 1.23 million lives, affecting 4.2 billion people (many on more than one occasion) resulting in approximately US$2.97 trillion in global economic losses,” according to a report conducted by the UN Office for Disaster Risk Reduction (UNDRR).
Get ready for something exciting coming your way! xMatters latest release, Ninja, is on the horizon and will be available in production next week. Named in honor of the classic video game Ninja Gaiden, this latest batch of xMatters updates is sure to pack a punch — pun definitely intended. This release rolls out exciting new features like an intelligent Service Dependencies map and integrations with the broader Everbridge platform, among many other things.