2021 has been an eye-opening year for both businesses and consumers who use popular websites and applications. We have all seen notable increases in the frequency and severity of outages as dependency on internet infrastructure grows – with no signs of slowing down. With our reliance on automation and connectivity expected to increase in 2022 – let’s review some of the top internet outages and website downtime incidents of 2021.
With only a few days left of 2021, we all know what that means: making New Year’s resolutions. While some love the tradition of laying out their goals for the coming 12 months, others loathe it with a passion. And with approximately 80% of people failing to achieve their resolutions, it’s easy to see why there’s so much resentment towards this common habit. At xMatters, we plan to—and often do—beat those odds.
A couple of weeks ago our partner Rok Ponikvar from S&T contacted me about an issue one of his customers faced. His customer complained that Enterprise Alert is not alerting on current issues and even if he creates a test ticket in his OBM system no alert goes out. After a little back and forth we concluded that Enterprise Alert is still processing historic data from an Event Storm in OBM earlier that day.
In light of the recent news about yet another reported Zero-Day Exploit and the accompanying discussions about security, let’s touch on the topic of security audits and how Enterprise Alert can be configured to avoid or at least minimize potential security impact. First, let’s establish what we mean by security audit.
Oracle is gearing up to execute the largest deal in its entire history – the company has agreed to buy Cerner, a leading electronic health records vendor, for $28.3 billion. The Cerner acquisition is slated to be an all-cash deal of $95/share and is expected to complete early next year. Cerner is a healthcare technology firm that streamlines health information and facilitates its accessibility for modern clinical teams.
The benefits of using the correct reporting, analytics and information delivery capabilities can transform an organization. Having access to timely data, reporting, and analytic capabilities helps to ensure the ability to get the right data to the right users at the right time. Having the ability to pull any information that your business needs at any given time allows for the flexibility to get the information for your business when and where it is needed.
With Christmas only a few days away, we’d like to do a round-up of something extra festive that we’ve been sharing on social media: The 12 days of Tip-Mas with xMatters! Each day offers the “gift” of a top tip, a resource, or fun fact about xMatters. So go ahead — sing along to get into the holiday spirit!
We've found a pattern to mock external client libraries while keeping code simple, reducing the number of injection spots and ensuring all the code down a callstack uses the same mock client. Establishing patterns like these is what makes test suites great, and improves developer productivity when writing tests. Here's how it works.
Many organizations are experiencing the need to modernize their IT systems to keep pace in an increasingly digital world. Adopting DevOps helps companies implement and initialize the modernization processes. At xMatters, our path to IT modernization has included implementing DevOps, but we have done it a little differently to ensure we are using agile processes.
We're very pleased to announce that incident.io is now SOC 2 compliant, having successfully completed our Type I audit. Put simply, this means an external auditor has looked at how the company is operating, and how our software is managed and operated, and confirmed that we meet a set of high security standards.
SREs face special challenges during the holidays. Here’s how to manage them.
Like many SaaS businesses, we have an on-call rota to enable us to provide 24x7 cover if there are problems with incident.io. We have a 'pager' which will alert the relevant person if something unexpected happens in our app, so that they can investigate and fix it if needed. Note: This was adapted from an internal document we wrote about how we think about on-call at incident.io.
5 suggestions to mark the embrace of availability for executives during a time of digital transformation.
Introduction As all companies become software driven, DevOps is becoming an important practice in enterprises and startups across the world. DevOps is about bringing velocity to delivering tech products and services, so you can delight customers and meet business goals. To achieve this velocity, development (dev) and operations (ops) teams work closely together across the software lifecycle - from planning to release. And this has led to a new role in engineering teams - DevOps Engineer.
It takes a lot to run a modern business. From websites to technical solutions and everything in between, it’s no surprise we need better monitoring systems to make sure everything is operational. With multiple gears turning at once on any given platform, incidents are inevitable—especially for companies that are constantly growing and innovating. And the impact of incidents can affect user services, operations, and even business reputation.
Building a complex new product can be scary. What if no-one gets value from it? What if it doesn't work? What if it's hard to change? One way to mitigate these risks is to break down the product into smaller shippable increments, allowing you to capture feedback early and confirming the most important assumptions before fully committing to a solution.
Earlier this fall, we announced a significant evolution in the IT process automation portfolio at PagerDuty—the general availability of PagerDuty Rundeck Actions and early access for Rundeck Cloud. These new offerings reflect our vision to enable companies to take real-time actions by democratizing access to automation. In other words, to quickly and safely delegate automated IT processes to the IT users (and APIs) that need them to get work done.
As a Solution Architect here at xMatters, an Everbridge Company, and through my 30-year career in the IT industry, I've seen many frameworks offering bold new ideas. CMMI, ITIL, Prince 2, Agile, Scrum, and most recently, DevOps. These frameworks come and go, offering huge improvements in the way we deliver and manage our IT capabilities, but never lasting long enough to act on those promises. That's not to say they haven't made a marked difference in the IT space, or that they haven't been hugely impactful for organizations around the globe. They become launching off points for a new framework, and now there's a new term that's appeared, DevSecOps.
ROI might be one of the most popular business acronyms in recent memory, and business to business, the definition remains the same: return on investment. No matter the industry, leaders are concerned with ROI and ensuring that every dollar spent is used in the best interest of the organization. But in practice, what does ROI really mean? Let’s discuss!
We’re excited to announce that private incidents are now available on FireHydrant. For the first time, incidents can have visibility limited to only permissioned users are able to see. This is a great solution for security and compliance teams who need to collaborate with their engineering counterparts to resolve incidents. The nature of these incidents that these teams work on dramatically differs from operational incidents.
In the IT world, application service providers (ASPs) build customer trust by ensuring the continuous, uninterrupted availability of their services and software. Service availability allows customers to operate normally and generate revenue without being directly impacted by their providers’ system failures. Though providers work to ensure system uptime, they are often challenged by unexpected technical issues that impact customer-facing systems.
What a year 2021 has been for us all. We are extremely proud of the continuous innovation and delivery of new features and functionality we have provided throughout the year, all while maintaining enterprise scale and uptime that could win awards. We’ve heard success story after success story from our brilliant customers, each unique in their own way. We couldn’t have had the successful year we’ve had without you, and it’s been our honor to be part of your success.
An overview of how SREs can benefit from Infrastructure-as-Code.
ServiceNow is widely used across Fortune 1000 and Global 5000 enterprises, so it’s no wonder that the majority of BigPanda customers use ServiceNow and integrate with it to streamline their ticketing requests. BigPanda’s AIOps Event Correlation and Automation Platform provides context-rich incidents to IT Ops teams relying on ServiceNow and helps them gain end-to-end real-time visibility into their operations.
In the world of a site reliability engineer (SRE), failure is not only an option, but also expected. Systems, web applications, servers, devices, etc., are all prone to performance issues and unexpected outages at some point. It is an unavoidable fact. These unexpected failures can lead to huge revenue losses, customer trust and depending on the industry, maybe fines. Fortunately, SRE incident management is one of the core practices used to limit the disruption caused by unexpected issues.
The following is an analysis of the Amazon Web Services incident on 12/07/2021. Millions of users were affected by an Amazon Web Services outage that took down major online services such as Amazon, Amazon Prime, Amazon Alexa, Venmo, Disney+, Instacart, Roku, Kindle, and multiple online gaming sites. The outage, which originated in the US-EAST-1 region on Dec. 7, 2021, is still ongoing at the time of blog publication.
The next great space race is on. Today, there are multiple companies competing to earn their slice of a global space industry set to be worth more than $1 trillion by 2040. However, launching a satellite into space still isn’t an option for most organizations due to the prohibitive costs and complex engineering required.
The managed security services market is booming. Coming in at $22.8 billion in 2021, it is projected to nearly double in just five years and grow to $43.7 billion by 2026. Moreover, cloud-based managed security services are poised to be the major growth driver for the broader MSP market, coming in at $219.59 billion in 2021, and expected to reach $557.10 billion by 2028. As we can see, providing robust security services is a key competitive differentiator for the lucrative MSP market.
In the world of always-on services, many organizations have taken the path to modernize their IT operations to provide greater agility, lower cost, and most importantly, to deliver frictionless digital customer experiences. Is your DevOps team deploying more frequently than operations can support? Are you struggling to keep up with the maintenance issues associated with aging software? Modernizing your IT operations can be the key to overcoming these complexities.
We’re excited to announce a new set of updates and enhancements to the PagerDuty platform. The product team has been hard at work making updates from Event Intelligence, Runbook Automation, and Applications with Monitoring Tools, to PagerDuty and PagerDuty Community Events.
The holiday season is here, and global retailers are prepared for the biggest retail event of the year. The decrease in new COVID-19 cases, coupled with a rise in vaccination rates, provides a glimmer of hope for shoppers looking to spend for friends and family. Holiday spending is expected to break previous records this year, growing up to 10.5 percent over 2020.
They are like 5 stages of an incident: 1. Assess impact 2. Inform customers (statuspage) 3. Identify the issue 4. Mitigate the issue 5. Resolve the incident Then there’s followup and further work. Also important to note that (2) should be ongoing as you progress. Updating the status page should be done within reasonable periods – e.g. every 15-20 mins unless you specify otherwise.
Although every company can benefit from SREs, some need SREs more than others.
This blog post defines SRE by explaining SLOs and error budgets, highlighting the innovation vs. reliability tradeoff.
Our December update brings a ‘Who is on duty’ board displaying current team members on duty with contact information. In addition, we have simplified the manual sending of Signls and improved the integration with Azure Sentinel. As always, you can find all the details in this article.
After many weeks of work, we're delighted to announce the latest feature of the incident.io platform: Workflows. Configure your processes once, and we'll make sure you follow them, every time ✨ A little while ago, I was asked the question: “what makes a good incident response?”. Whilst there’s infinite nuance in the answer, mine was pretty straightforward. The best incidents are founded on principles of communication, coordination, and clear roles and responsibilities.
When we asked how technology leaders are feeling about increased pressure on digital services, they reported that, unsurprisingly, their investments in digital have grown. In fact, 72% are ramping up digital transformation efforts. Yet while the C-suite is interested in AIOps and automation to help their teams, it’s not always clear what their approach should be and how this technology can be applied to solve problems for their teams today.