Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Top 5 Resiliency Trends of 2023

In today’s world, resilience is no longer a conditioned desire or methodology to try but has become a necessity for sustained success in software development and IT operations. As DevOps and Agile teams keep moving forward to cross boundaries, come up with new methodologies, and drive innovation, it is now important to have the ability to quickly recover from failures, adapt to changing conditions, and maintain high performance under pressure.

Implementing Zero Trust: A Practical Guide

According to the Harvard Business Review, 2022 saw more than 83% of businesses experiencing multiple data breaches. Ransomware attacks, in particular, were up 13%. With cyber security being such a hot topic for business owners, it’s no surprise implementing a zero trust policy has become so important. In this guide, we’ll cover how to implement zero trust and why it’s important for your business to do so. Let’s get started.

Mastering Incident Resolution: Process and Best Practices

For DevOps and IT teams, incident resolution is an important aspect of predicting, resolving, and documenting service disruptions. It refers to the part of the incident management process where responders restore the service to functioning. Modern technology has come a long way, but it’s not without flaws. When businesses suffer from cyber-attacks, system crashes, and network outages, it impacts the organization on many levels.

Streamlining Incident Management with our latest feature update: Merge Incidents

Hey folks! We‘re back with another nifty feature to your Incident Management tool arsenal. You now have the ability to merge incidents with a few clicks! With this latest update you can reduce the noise while dealing with a complex incident by merging incidents across services under a parent incident. Typically this can occur when multiple incidents stem from the same underlying issue or root cause.

Journey from Junior to Senior SRE: Key Insights and Strategies

As Site Reliability Engineering (SRE) continues to grow in popularity, many professionals are looking for ways to advance from junior to senior roles. While there is no one-size-fits-all approach, the transition from junior to senior SRE is marked by a gradual increase in experience and a set of key skills. In this blog, we will explore the valuable insights and strategies shared by experienced SREs.

What's the Difference Between an Agile Retrospective and an Incident Retrospective?

Blameless Chief Operating Officer Ken Gavranovic recently sat down with Lee Atchison, a renowned expert in system reliability, to discuss the topic of conducting effective incident retrospectives. You can watch their engaging, informative discussion below, or read on for our overview of the greatest hits from their talk. ‍ Agile development and incident management are the backbones of any tech-driven development cycle. At the heart of these practices lies the art of retrospectives.

Elastic AI Assistant for Observability

Harness the power of generative AI to turn insights into actions. Powered by the Elasticsearch Relevance Engine™ (ESRE™), Elastic’s AI Assistant (in technical preview for Observability) transforms problem identification and resolution by eliminating manual data chasing across silos to an interactive assistant that delivers accurate and context-aware remediation for SREs.

Seven Models of Cloud Native Applications

In today's cloud-driven landscape, organizations are transitioning from legacy monolithic systems to agile, scalable, and secure cloud-native solutions. Some are even forging new cloud-native applications. However, the concept of cloud-native design remains subjective, lacking a universal blueprint. This blog aims to provide clarity and guidance for designing precise cloud-native applications and container deployment.