Operations | Monitoring | ITSM | DevOps | Cloud

Alerting

How to Control Alert Fatigue?

Alerts are indispensable to any IT operations system today. Site reliability engineers (SREs) or ITOps executives set up several monitoring tools for their IT landscape. When there is a change, high-risk action, or outage in any of these incidents, the monitoring tool triggers an automated alert. This could happen on the monitoring tool’s dashboard itself, via email, or enterprise collaboration tools like Slack or Teams.

CheckMK and Enterprise Alert - a scripted heartbeat check

A few days ago I received an inquiry about a scripting problem from one of our longtime partners, to be exact our DCP Marc Handel from IT unlimited AG. In the exchange with Marc I realized that his idea to use the Enterprise Alert Scripting Host, the Windows Task Scheduler and CheckMK to realize a roundtrip monitoring could be interesting for the whole community. Especially for all our CheckMK customers.

How Status Pages Can Help You Retain Customers in This Digital Age

The impact of 2020 and the COVID-19 pandemic collectively upended customer behavior with sweeping and immediate changes that we are still feeling the effects of now. As a result people have been forced to live and work differently, which has had a direct impact on consumption patterns and shopping habits. Consumers have for example been displaced from their traditional in-person experiences.

How Uptime.com Can Help Troubleshoot a Server Outage

Everyone has heard about the 3 AM wakeup call, but what about those troublesome issues that dig at your team and eat away at your SLA hours? Hard-to-diagnose issues can strike at any time. They leach from your team, hurt morale, impede the customer experience… it’s just a whole mess. These kinds of incidents are ones that test what “response” really means to your organization, as fixing them is not always a simple task. Something has gone wrong.

Essential Tools for Site Reliability Engineers

Site reliability engineers (SREs) are involved in scaling systems and making them reliable and efficient for organizations. But SREs often fail to build system resiliency when they do not have the right tools at their disposal. In this post, we’ll uncover five leading tools that SREs can use to drive the reliability and stability of computing systems. It also examines how SREs can use the tools to improve operations tasks and infrastructure processes.

Monthly Moo Update | September 2021

This has been quite the summer to remember as we continue to witness our customers achieve remarkable efficiencies through automation such as deep integrations with change pipelines to suppress alerts during maintenance windows and correlating alerts to create incidents with dynamic and evolving descriptions that dramatically improve Incident management processes.

Robotic Data Automation (RDA): Reducing Costs and Improving Efficiencies of Your Log Management Investment

People’s involvement has been inevitable with log management despite advancements in ITOps. Log management at a high level collects and indexes all your application and system log files so that you can search through them quickly. It also lets you define rules based on log patterns so that you can get alerts when an anomaly occurs. Log management analytics solution leveraging RDA has been able to detect anomalies and aid predictive models over a machine learning layer.