Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

The Evolution of Observability and Monitoring panel discussion Failover Conf 2021

Observability and monitoring are critical to detecting and troubleshooting problems to build more reliable applications. As our systems become increasingly complex, our tools for getting this crucial visibility and the way we respond need to evolve too. We'll sit down with SRE leaders to discuss the processes they use to get the most insight into their applications, how they've increase the speed of detection and response, and what organizations need to do to stay on top of growing complexity.

The Evolution of Teams & Culture panel discussion Failover Conf 2021

The most successful organizations are the ones that embrace change and use it to become stronger and more resilient. In this panel discussion, we'll talk with engineering leaders about how they adapted to the challenges of 2020, what successes (and failures) they've seen, and where the future of reliable engineering teams is headed.

Announcing Services Discovery for tracking and improving service reliability

Gremlin helps teams proactively improve the reliability of their systems by running chaos experiments on infrastructure including hosts, containers, and Kubernetes clusters. But as microservice-based architectures and automated cloud platforms become the norm, engineers are shifting their focus from managing infrastructure to managing services. In order to keep these services as resilient as possible, they need tools that can help them find failure modes, reduce incidents, and improve availability.

Chaos Engineering in 60 seconds - Attack a service

Learn how to run a chaos experiment on a distributed service using Services Discovery in Gremlin. Gremlin is the enterprise Chaos Engineering platform on a mission to help build a more reliable internet. Their solutions turn failure into resilience by offering engineers a fully hosted SaaS platform to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.

Announcing Services Discovery for tracking and improving service reliability

Gremlin announces Services Discovery for tracking and improving the reliability of distributed services. Gremlin is the enterprise Chaos Engineering platform on a mission to help build a more reliable internet. Their solutions turn failure into resilience by offering engineers a fully hosted SaaS platform to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.

Announcing role based access control for API keys for more control over automation

Today, Gremlin is excited to announce the ability to create an API key that can perform actions with the same set of permissions as your user account. This allows you to automate Gremlin tasks safely and securely.

How Netflix Uses Fault Injection To Truly Understand Their Resilience

Distributed systems such as microservices have defined software engineering over the last decade. The majority of advancements have been in increasing resilience, flexibility, and rapidity of deployment at increasingly larger scales. For streaming giant Netflix, the migration to a complex cloud based microservices architecture would not have been possible without a revolutionary testing method known as fault injection. With tools like chaos monkey, Netflix employs a cutting edge testing toolkit.

Announcing our latest attacks to deal with meeting fatigue

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin. With everyone working remotely, video conference tools like Zoom have been a critical part of maintaining business continuity. It’s truly amazing that we can continue to work and connect with one another, even during a time where getting together in an office hasn’t been possible…

It's all Chaos! And it Makes for Resilience at Scale

Chaos engineering is a practice where engineers simulate failure to see how systems respond. This helps teams proactively identify and fix preventable issues. It also helps teams prepare responses to the types of issues they cannot prevent, such as sudden hardware failure. The goal of chaos engineering is to improve the reliability and resilience of a system. As such, it is an essential part of a mature SRE solution.