Operations | Monitoring | ITSM | DevOps | Cloud

Scientific Incident Management with Dan Slimmon

Dan Slimmon is an incident management veteran who's worked at Etsy, HashiCorp, and now leads consulting and training on pragmatic, non-bureaucratic incident response. In this episode, Dan shares his philosophy on "scientific incident response," the importance of hypothesis-driven troubleshooting, and why incidents should be seen as normal in complex systems.

Datadog On Datadog

At Datadog, over 2,000 engineers deploy and ship new features daily. As a leading observability and security platform used by thousands of companies, ensuring quality and reliability is no small feat. Part of our commitment to excellence lies in our dogfooding culture where our engineering organization is one of the largest and most demanding users of the Datadog platform.

How to keep track of what's running in your Gremlin team

•Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Reliability testing is ongoing, and tracking that work can be difficult in large organizations. According to our own product metrics, teams run an average of 200 to 500 tests each day! With so much happening, it’s hard to keep track of everything going on—unless you use Gremlin.

Why Monitoring iManage is Critical for Enhancing End-User Experience in Legal Firms

As a Performance Field Technical Consultant working with customers in the legal industry, my primary focus is to ensure that technology enhances productivity rather than hinders it. Legal professionals rely on iManage as a business-critical application for document management, collaboration, and compliance. However, with the increasing shift to the cloud and integration with platforms like O365, ensuring a seamless user experience has become more complex.

How we responded to a 2+ hour partial outage in Grafana Cloud

On Tuesday, Feb. 18, 2025, we experienced an outage that lasted approximately 150 minutes and impacted roughly 25% of our Grafana Cloud services. To our customers: we are very sorry and more than a little embarrassed that we stepped outside our own processes and advice to cause this. You rely on us to help monitor and troubleshoot your environments, and this type of incident obviously makes it harder for you to do that.

AI Incident Summarization in 50 Seconds

Ivanti's AI Incident Summarization feature lets your IT team catch up on the history of a given incident at the speed of GenAI, so they can help your end users solve their problems that much faster. Ivanti finds, heals, and protects every device, everywhere – automatically. Whether your team is down the hall or spread around the globe, Ivanti makes it easy and secure for them to do what they do best.

Effortless observability for Django applications

Observability is critical for web operations to ensure that the application is working as expected and to identify any potential issues. However, setting up observability has traditionally been challenging because it can take hours to set up all the infrastructure, instrument your code and enable observability in production. But now there is a better way using native support for Django in Charmcraft and Rockcraft which has observability built in and ready to go!