April 2021

Fireside Chat with Jeff Smith and Matt Stratton Failover Conf 2021

Apr 29, 2021 By Gremlin In Gremlin

Matt Stratton, host of the Arrested DevOps podcast, will host Jeff Smith, Director of Production Operations at Centro and author of the book "Operations Anti-patterns, DevOps Solutions" for an engaging conversation about building reliable teams using DevOps principles.

View Video

Gremlin

Read more about Fireside Chat with Jeff Smith and Matt Stratton Failover Conf 2021

Fireside Chat with Jesse Robbins and Kolton Andrus Failover Conf 2021

Apr 29, 2021 By Gremlin In Gremlin

Long before Chaos Engineering was even a phrase, Jesse Robbins was Amazon.com's "Master of Disaster" using intentional failure to help the company become more reliable. Kolton Andrus (CEO at Gremlin), sits down with Jesse to learn more about his early work with GameDays, the evolution of reliability, and where the future of SRE lies.

View Video

Gremlin

Read more about Fireside Chat with Jesse Robbins and Kolton Andrus Failover Conf 2021

Fireside Chat with Ines Sombra and Ana Medina Failover Conf 2021

Apr 29, 2021 By Gremlin In Gremlin

Reliability is a requirement for the modern internet. Ana Medina joins Inés Sombra, Sr. Director of Engineering at Fastly, to discuss their approach to resilience, how the past year has influenced the way they work, and what practices your engineering organization can adopt to become more reliable.

View Video

Gremlin

Read more about Fireside Chat with Ines Sombra and Ana Medina Failover Conf 2021

Implementing DevSecOps in the DoD by Nicolas Chaillan Failover Conf 2021

Apr 28, 2021 By Gremlin In Gremlin

Delivering software quickly and securely is important for every organization, but it's even more important at the US Department of Defence (DoD) where reliability directly impacts national security. Nicolas Chaillan (Chief Software Officer, US Air Force) will discuss the DoD Enterprise DevSecOps Initiative—an initiative he leads along with the DOD’s Chief Information Officer that brings automated software tools, services and standards to DoD programs. He'll also share about Platform One, the Air Force's DoD-wide DevSecOps Enterprise Level Service that provides managed IT services capabilities, on-boarding, support, and baked-in zero trust security. This insight from operating at the most rigorous level will help you level up your own organization.

View Video

Gremlin

Read more about Implementing DevSecOps in the DoD by Nicolas Chaillan Failover Conf 2021

Pragmatic Incident Response: Lessons learned from failures by Robert Ross Failover Conf 2021

Apr 28, 2021 By Gremlin In Gremlin

Incident response is overwhelming. So where do you start? There's a lot of advice out there, but it's mostly theories that aren't taking reality into account. So how do you get a process in place that actually works and scales? In this session, FireHydrant CEO and Co-Founder, Robert Ross, will share quick stories from his experience as an SRE and what tips he’s learned along the way.

View Video

Gremlin

Read more about Pragmatic Incident Response: Lessons learned from failures by Robert Ross Failover Conf 2021

Whats Next for DevOps by Emily Freeman Failover Conf 2021

Apr 28, 2021 By Gremlin In Gremlin

For over a decade, the DevOps movement has been using cultural change to power technological transformation and help companies deliver better products faster and more reliably. While many organizations have embraced this change and reaped the benefits, it hasn't come without challenges and many more remain. In this session, Emily Freeman (author of DevOps for Dummies) shares what's next for DevOps and how it will impact your organization.

View Video

Gremlin

Read more about Whats Next for DevOps by Emily Freeman Failover Conf 2021

The Evolution of Observability and Monitoring panel discussion Failover Conf 2021

Apr 28, 2021 By Gremlin In Gremlin

Observability and monitoring are critical to detecting and troubleshooting problems to build more reliable applications. As our systems become increasingly complex, our tools for getting this crucial visibility and the way we respond need to evolve too. We'll sit down with SRE leaders to discuss the processes they use to get the most insight into their applications, how they've increase the speed of detection and response, and what organizations need to do to stay on top of growing complexity.

View Video

Gremlin

Read more about The Evolution of Observability and Monitoring panel discussion Failover Conf 2021

The Evolution of Teams & Culture panel discussion Failover Conf 2021

Apr 28, 2021 By Gremlin In Gremlin

The most successful organizations are the ones that embrace change and use it to become stronger and more resilient. In this panel discussion, we'll talk with engineering leaders about how they adapted to the challenges of 2020, what successes (and failures) they've seen, and where the future of reliable engineering teams is headed.

View Video

Gremlin

Read more about The Evolution of Teams & Culture panel discussion Failover Conf 2021

Leaving the Nest: Guidelines, guardrails, and human error by Laura Santamaria Failover Conf 2021

Apr 28, 2021 By Gremlin In Gremlin

When we talk about reliable systems, we talk a lot about human error. Human error in an incident or a bug report is often treated with a bit of a facepalm reaction. The term masks a lot of scenarios from accidents to exhaustion to everything in between. However, human error helps us understand where our processes failed and how we can prevent the same error from happening again. In short, we need to think in terms of a framework of guidelines and guardrails. In this short talk, let’s discuss how guidelines like runbooks and guardrails like automation can help us address the fact that everyone will, at some point, make mistakes.

View Video

Gremlin

Read more about Leaving the Nest: Guidelines, guardrails, and human error by Laura Santamaria Failover Conf 2021

Announcing Services Discovery for tracking and improving service reliability

Apr 27, 2021 By Matt Schillerstrom In Gremlin

Gremlin helps teams proactively improve the reliability of their systems by running chaos experiments on infrastructure including hosts, containers, and Kubernetes clusters. But as microservice-based architectures and automated cloud platforms become the norm, engineers are shifting their focus from managing infrastructure to managing services. In order to keep these services as resilient as possible, they need tools that can help them find failure modes, reduce incidents, and improve availability.

Read Post

Gremlin

Read more about Announcing Services Discovery for tracking and improving service reliability

Chaos Engineering in 60 seconds - Attack a service

Apr 27, 2021 By Gremlin In Gremlin

Learn how to run a chaos experiment on a distributed service using Services Discovery in Gremlin. Gremlin is the enterprise Chaos Engineering platform on a mission to help build a more reliable internet. Their solutions turn failure into resilience by offering engineers a fully hosted SaaS platform to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.

View Video

Gremlin

Read more about Chaos Engineering in 60 seconds - Attack a service

Announcing Services Discovery for tracking and improving service reliability

Apr 27, 2021 By Gremlin In Gremlin

Gremlin announces Services Discovery for tracking and improving the reliability of distributed services. Gremlin is the enterprise Chaos Engineering platform on a mission to help build a more reliable internet. Their solutions turn failure into resilience by offering engineers a fully hosted SaaS platform to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.

View Video

Gremlin

Read more about Announcing Services Discovery for tracking and improving service reliability

Announcing role based access control for API keys for more control over automation

Apr 22, 2021 By Matt Schillerstrom In Gremlin

Today, Gremlin is excited to announce the ability to create an API key that can perform actions with the same set of permissions as your user account. This allows you to automate Gremlin tasks safely and securely.

Read Post

Gremlin

Read more about Announcing role based access control for API keys for more control over automation

Creating Chaos to Achieve Reliability

Apr 22, 2021 By JJ Tang In Rootly

How can creating chaos achieve better reliability? Chaos and reliability might seem mutually exclusive, but through the use of Chaos Engineering, SREs can bring about meaningful changes to system resiliency.

Read Post

Rootly

Read more about Creating Chaos to Achieve Reliability

How Netflix Uses Fault Injection To Truly Understand Their Resilience

Apr 6, 2021 By Thomas Russell In Coralogix

Distributed systems such as microservices have defined software engineering over the last decade. The majority of advancements have been in increasing resilience, flexibility, and rapidity of deployment at increasingly larger scales. For streaming giant Netflix, the migration to a complex cloud based microservices architecture would not have been possible without a revolutionary testing method known as fault injection. With tools like chaos monkey, Netflix employs a cutting edge testing toolkit.

Read Post

Coralogix

Read more about How Netflix Uses Fault Injection To Truly Understand Their Resilience

Announcing our latest attacks to deal with meeting fatigue

Apr 1, 2021 By Gremlin In Gremlin

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin. With everyone working remotely, video conference tools like Zoom have been a critical part of maintaining business continuity. It’s truly amazing that we can continue to work and connect with one another, even during a time where getting together in an office hasn’t been possible…

Read Post

Gremlin

Read more about Announcing our latest attacks to deal with meeting fatigue

Operations | Monitoring | ITSM | DevOps | Cloud

April 2021

Fireside Chat with Jeff Smith and Matt Stratton Failover Conf 2021

Fireside Chat with Jesse Robbins and Kolton Andrus Failover Conf 2021

Fireside Chat with Ines Sombra and Ana Medina Failover Conf 2021

Implementing DevSecOps in the DoD by Nicolas Chaillan Failover Conf 2021

Pragmatic Incident Response: Lessons learned from failures by Robert Ross Failover Conf 2021

Whats Next for DevOps by Emily Freeman Failover Conf 2021

The Evolution of Observability and Monitoring panel discussion Failover Conf 2021

The Evolution of Teams & Culture panel discussion Failover Conf 2021

Leaving the Nest: Guidelines, guardrails, and human error by Laura Santamaria Failover Conf 2021

Announcing Services Discovery for tracking and improving service reliability

Chaos Engineering in 60 seconds - Attack a service

Announcing Services Discovery for tracking and improving service reliability

Announcing role based access control for API keys for more control over automation

Creating Chaos to Achieve Reliability

How Netflix Uses Fault Injection To Truly Understand Their Resilience

Announcing our latest attacks to deal with meeting fatigue

Monthly Archive

Follow Us