Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Human-in-the-Loop DevOps  Taylor Barnett  Failover Conf 2020

Within DevOps, automation has become a North Star. We want to automate the toil away, but the goal of "no toil" is unattainable. Many runbooks can only be partially automated because they still require human intervention and insights. Human-in-the-Loop DevOps is the idea that we can benefit from automating toil while still embracing the human interaction in specific tasks.

The Future of DevOps is Resilience Engineering  Amy Tobey  Failover Conf 2020

For more than a decade, many of us have been working to bring Devops to organizations around the world. We’ve made amazing progress, but there’s so much more to do. Now that we have continuous integration & deployment widespread and developers are taking more ownership of production, what’s next? Amy will talk about what Resilience Engineering is, how it relates to devops, and how she thinks it gives us the science and research we need to take our organizations to the next level of robustness while remaining agile and growing our ability to care for the people around us.

Virtana Names Carahsoft Federal Distributor for Entire Product Portfolio

SAN JOSE, Ca. and RESTON, Va. – May 5, 2020 – Carahsoft Technology Corp., The Trusted Government IT Solutions Provider®, today announced that Virtana, the leader in hybrid infrastructure management for mission-critical workloads, has named Carahsoft its Federal distributor for its entire portfolio.

Top 3 benefits of Apache Cassandra and how to use it

It’s no secret that organisations have a love-hate relationship with data. Decision making can be unguided and market insights can be lost when organisations collect too little data. On the other hand, with large and active datasets, where requests number in the hundreds of thousands, maintaining database performance is increasingly difficult. One open source application, Apache Cassandra, enables organisations to process large volumes of fast moving data in a reliable and scalable way.

Performing chaos in a serverless world  Gunnar Grosch  Failover Conf 2020

Chaos engineering is the practice of hypothesis testing through planned experiments to gain a better understanding of a system’s behavior. The principles of chaos engineering have been around for years, and we have now reached the point where chaos engineering has gone from just being a buzzword and practice used by a few large organizations in very specific fields, to it being put in to use by companies of all sizes and industries.

Swim Don't Sink: Why Training Matters to a Site Reliability Engineering Practice  Jennifer Petoff

Do you offer training to the engineers in your organization or do you throw them off the deep end to “sink or swim”? Providing training and education is universally important to set team members up for success in your organization and is critical for establishing a thriving Site Reliability Engineering (SRE) or DevOps practice and culture in the first place.

Fight, Flight, or Freeze - Releasing Organizational Trauma Matt Stratton Failover Conf 2020

When humans are faced with a traumatic experience, our brains kick in with survival mechanisms. These mechanisms are the familiar fight or flight response, but can also include the freeze response - which occurs when we are terrified or feel that there is no chance of escape.

Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation  Heidi Waterhouse

Every disaster is a concatenation of smaller failures. How can we design software and processes to accept that we live in an imperfect world? Explore the concepts of resiliency, harm reduction, over-engineering, and planning for failure with real examples.

How to fail with Serverless  Jeremy Daly Failover Conf 2020

Everything fails all the time. Knowing how to deal with these failures in serverless applications becomes essential to building resilient, highly-available systems. In traditional monolithic applications, catching errors and handling retries is relatively straightforward. But as our systems become more distributed, we now have multiple (often asynchronous) components processing events from several sources, all with vastly different retry behaviors and failure mechanisms. Utilizing old patterns can cause errors to get swallowed, creating brittle, unreliable systems that are difficult to debug and hard to maintain.