Operations | Monitoring | ITSM | DevOps | Cloud

Alerting

Incident Ready: How to Chaos Engineer Your Incident Response Process

We’re pretty sure using a real incident to test a new response process is not the best idea. So, how do you test your process ahead of time? In this video, FireHydrant CEO, Robert Ross, shared how our customers leverage best practices to break, mitigate, resolve, and fireproof incident processes.

Incident Ready: How to Chaos Engineer Your Incident Response Process | FireHydrant

We’re pretty sure using a real incident to test a new response process is not the best idea. So, how do you test your process ahead of time? In this video, FireHydrant CEO, Robert Ross, will share how FireHydrant customers leverage best practices to break, mitigate, resolve, and fireproof incident processes. We’ll show you how to use chaos engineering philosophies to stress test 3 critical parts of a great process.

Microsoft's 3 major incidents in 10 days, where did they go wrong?

Just in case you haven’t heard, last week Microsoft experienced a huge outage that prevented users from accessing its Office 365 cloud-based subscription service which serves 200 million active monthly users. This latest outage was the third in ten days, causing the company to receive a deluge of customer complaints about a 'something went wrong' message that popped up when they tried to access their accounts.

October 2020 Update: Mute overwrite for iPhone (Critical Alerts), undo and more

Our October update brings the long-awaited mute-overwrite on iPhone (‘critical alerts’). We also introduce an undo action for Signl acknowledgements or closures. And in the web app you can now batch-ack and close to multiple Signls at once. All new features are introduced below – enjoy.

What is IT On-call?

An “on-call” worker is available to provide support at their employer’s request. Your enterprise may have on-call employees available across various departments, and these workers can help your business if problems arise, even outside of normal operating hours. Bonus Material: Advanced On-call Escalation Example PDF How you manage your on-call teams can have significant ramifications on your enterprise and its stakeholders.

Anomaly detection 101

What is anomaly detection? Anomaly detection (aka outlier analysis) is a step in data mining that identifies data points, events, and/or observations that deviate from a dataset’s normal behavior. Anomalous data can indicate critical incidents, such as a technical glitch, or potential opportunities, for instance a change in consumer behavior. Machine learning is progressively being used to automate anomaly detection.

How SIGNL4 provides for a digital handover procedure

Handover procedures in operations and maintenance are a key element of business continuity. As work in this field is usually organized in shifts, it is essential to keep track of any critical incidents, machine breakdowns, job ownership, completion, issues that are still open or unresolved and other related items. Such knowledge has a significant impact on a timely or even proactive response, for instance if issues re-surface.

Streamline communication workflows with the Datadog Slack App

Sharing information about the health and performance of an application is a critical part of any team’s daily workflow. That’s why we’re excited to announce the Datadog Slack App, which simplifies crucial communication tasks by deepening the integration between Datadog and Slack.