Operations | Monitoring | ITSM | DevOps | Cloud

What is Mean Time Between Failures - and why does it matter for service availability

Mean Time Between Failures (MTBF) measures the average duration between repairable failures of a system or product. MTBF helps us anticipate how likely a system, application or service will fail within a specific period or how often a particular type of failure may occur. In short, MTBF is a vital incident metric that indicates product or service availability (i.e. uptime) and reliability.

Upgrade to DX UIM 20.4 CU9 to Leverage New Features and Security Updates

DX Unified Infrastructure Management (DX UIM) is a powerful solution that enables comprehensive infrastructure observability across your digital ecosystems, including private, public, and hybrid clouds. With DX UIM, you can proactively and efficiently manage the performance and availability of your IT infrastructure and applications. DX UIM 20.4 is the current main branch of the solution. This release offers a number of significant capabilities that weren’t available in earlier versions.

How to detect and prevent memory leaks in Kubernetes applications

In our last blog, we talked about the importance of setting memory requests when deploying applications to Kubernetes. We explained how memory requests lets you specify how much memory (RAM for short) Kubernetes should reserve for a pod before deploying it. However, this only helps your pod get deployed. What happens when your pod is running and gradually consumes more RAM over time?

Build Your Own Network with Linux and Wireguard

Last Christmas, I bought my wife “Explain the cloud like I am 10” after she told me many times that it was hard for her to relate to what I am doing in my daily work at Qovery. While so far, I have been the sole reader to enjoy the book, I was wondering during my lecture if there were any resources to explain how to build all that. Most topics are software oriented.. So, in this article, I am going to explain how to build your own cloud network 🎊

Why Do Monitoring Service Thresholds Overlap?

Although the title of this blog poses the question “Why do Monitoring Service Thresholds Overlap?”, really the question should be: “In Remote Monitoring and Management Solutions, Why Do Some Monitoring Service Thresholds Overlap?”. That’s a bit of a mouthful, but it’s what I’m going to look at in this blog. Here’s why overlapping thresholds in remote monitoring matter.

Warning: 3 Reasons Why You Shouldn't Pay a "Setup Fee" When Buying a Website Uptime Monitoring Solution

As you may have already discovered (or will soon encounter), many vendors that offer uptime monitoring solutions charge a setup fee. But instead of seeing this as a legitimate cost, you should view it as stop sign. There are three reasons why.

Full Stack Observability Guide - Examples and Technologies

As modern software systems become increasingly distributed, interconnected, and complex, ensuring production reliability and performance is becoming harder and more stressful. Seemingly nondescript changes to our infrastructure or application can have massive impacts on system uptime, health, and performance, all while the cost of production incidents continues to grow.

Internal Developer Platform vs Internal Developer Portal: Solving for a Central System of Record, and Action

What support do developers need at large enterprises to be productive? We often fall into the trap of evaluating coders on output, maybe even innate talent. We think that the best way to build secure and efficient software is to hire 10X developers, and get out of their way. But even if the individuals have massive intellectual firepower, operational work grows like entropy in the system.