May 2022

Getting AWS CloudTrail alerts via SNS Endpoint

May 31, 2022 By Vishal Padghan In Squadcast

Logging and auditing have been an essential part of troubleshooting application and infrastructure performance. You can instantly spot areas of risk to ensure quick correction and prevention of issues. In this blog, we will explore the AWS CloudTrail service and discuss how integrating it with Squadcast can help you route alerts to the right users for quick and efficient incident response. Let's get started.

Read Post

Squadcast

Read more about Getting AWS CloudTrail alerts via SNS Endpoint

DevOps Team Structure | Roles & Responsibilities

May 31, 2022 By Noor-ul-Anam Ruqayya In Blameless

We explain how a DevOps team is structured, the roles and responsibilities within the team, and the balance between an individual contributor and the needs of the team.

Read Post

Blameless

Read more about DevOps Team Structure | Roles & Responsibilities

Simplifying SLO and Error Budget tracking for SRE teams

May 28, 2022 By Vishal Padghan In Squadcast

Service level objectives (SLOs), and the subsequent service level indicators (SLIs) are the foundation to establishing a strong SRE culture and how they promote accountability, trust and timely innovation. We are on a mission to simplify SLO and Error Budget tracking and with that aim in mind, we have added the SLO Tracker feature to the Squadcast platform. SLO Tracker seeks to provide a simple and effective way to keep track of your error budget burn rate without the hassle of configuring and aggregating multiple data sources.

Read Post

Squadcast

Read more about Simplifying SLO and Error Budget tracking for SRE teams

5 Tips If You're the 1st SRE Hire by Instacart's First SRE

May 27, 2022 By Quentin Rousseau In Rootly

Site Reliability Engineers (SREs) have a considerable set of tasks to juggle no matter where they work or how long their company has had an SRE practice. But if you’re the very first SRE to join an organization – as many SREs are these days, given that the SRE trend is trickling down into smaller and smaller companies – you face a special group of challenges. You may find it difficult to get buy-in for SRE from other technical teams.

Read Post

Rootly

Read more about 5 Tips If You're the 1st SRE Hire by Instacart's First SRE

SRE basics: Understanding SLAs, SLOs and SLIs

May 27, 2022 By Aimee Pearcy In Reliably

SLAs, SLOs and SLIs are fundamental to site reliability engineering (SRE), but what are they and why are they important for delivering services?

Read Post

Reliably

Read more about SRE basics: Understanding SLAs, SLOs and SLIs

10 Reasons You Need A Service Level Agreement & Why It's important

May 26, 2022 By Mbaoma Mary In Reliably

A Service Level Agreement (SLA) consists of many service commitments. It is an essential part of a contract to outsource software development or software support between two or more parties, specifying the duties and the quality and type of service a company would provide for a fee to a customer.

Read Post

Reliably

Read more about 10 Reasons You Need A Service Level Agreement & Why It's important

What Is DevOps Automation & What Are The Benefits?

May 26, 2022 By Myra Nizami In Blameless

Looking into DevOps automation? We explain how automation can improve your process, how to prioritize which tasks to automate, best practices, and how to avoid common mistakes. ‍

Read Post

Blameless

Read more about What Is DevOps Automation & What Are The Benefits?

Error Budgets: Ultimate SRE Guide For Teams

May 26, 2022 By Samadrita Ghosh In Reliably

Any engineered system does not guarantee 100% uptime. There are bound to be some unforeseen system failures that cause downtime for the customers or create a poor customer experience. It is, therefore, best practice to take into account a margin for plausible failures. An error budget is this margin of error that the customer is informed about beforehand to secure tolerance during system failure for a decided number of hours.

Read Post

Reliably

Read more about Error Budgets: Ultimate SRE Guide For Teams

5 Key Requirements of Modern Enterprise Monitoring & Observability Platforms

May 25, 2022 By Heather Miller In Circonus

Monitoring is an essential function of enterprise SRE teams and a critical component of business service deliverability. Its importance has only grown as enterprise environments and technologies continue to evolve at a rapid pace. Unfortunately, traditional monitoring is no longer enough.

Read Post

Circonus

Read more about 5 Key Requirements of Modern Enterprise Monitoring & Observability Platforms

SRE: From Theory to Practice | What's difficult about incident command

May 24, 2022 By Emily Arnott In Blameless

A few weeks ago we released episode two of our ongoing webinar series, SRE: From Theory to Practice. In this series, we break down a challenge facing SREs through an open and honest discussion. Our topic this episode was “what’s difficult about incident command?” When things go wrong, who is in charge? And what does it feel like to do that role?

Read Post

Blameless

Read more about SRE: From Theory to Practice | What's difficult about incident command

Shift Left Reliability meetup - May Fifteen minutes or bust

May 20, 2022 By Reliably In Reliably

There is a yawning gap opening up between the best and the rest — the elite top few percent of engineering teams are making incredible gains year on year in velocity, reliability and human compatibility, whilst the bottom 50% are actually losing ground. The loss has nothing to do with engineering ability. Take an engineer out of an elite-performing team and place them in the bottom 50%, and they become subpar too; take an engineer out of a mediocre team and embed them in an elite team, and they are pulling their weight within the year.

View Video

Reliably

DevOps
SRE

Read more about Shift Left Reliability meetup - May Fifteen minutes or bust

Severity vs. Priority | Understanding the Differences

May 19, 2022 By Myra Nizami In Blameless

Wondering about severity vs. priority? We explain severity and priority and discuss their differences and their impact on the incident management process.

Read Post

Blameless

Read more about Severity vs. Priority | Understanding the Differences

Is It Really An Incident?

May 18, 2022 By Kurt Andersen In Blameless

At first glance, people tend to think that incidents are cut-and-dried, relatively objective occurrences. But if you look closely, incidents are highly varied, often require unique handling, and often defy clear answers to something as seemingly simple as knowing when they even start.

Read Post

Blameless

Read more about Is It Really An Incident?

A Chat with Lex Neva of SRE Weekly

May 17, 2022 By Emily Arnott In Blameless

Since 2015, Lex Neva has been publishing SRE Weekly. If you’re interested enough in reading about SRE to have found this post, you’re probably familiar with it. If not, there’s a lot of great articles to catch up on! Lex selects around 10 entries from across the internet for each issue, focusing on everything from SRE best practices to the socio- side of systems to major outages in the news. ‍ I had always figured Lex must be among the most well-read people in SRE, and likely #1.

Read Post

Blameless

Read more about A Chat with Lex Neva of SRE Weekly

The Journey Of Building Reliability And Scaling Your Systems

May 14, 2022 By Stoyan Yanev In Reliably

Starting small and scaling your systems to serve billions of requests per month is never an easy path, so how do you build an infrastructure whilst making the right decisions and compromises for your services? Choosing the right technology stack and ensuring your CI/CD pipeline is reliable are two key steps towards this which we will explore.

Read Post

Reliably

Read more about The Journey Of Building Reliability And Scaling Your Systems

What Does It Mean To Build Resilient Service Applications?

May 14, 2022 By Yan Cui In Reliably

Resilience is the capability to recover quickly from difficulties or toughness. It is not about preventing failures, but being able to recover from them quickly. As Amazon’s CTO Werner Vogels famously said ‘everything fails all the time’. It’s a fact of life that failures will inevitably happen but what we can do is build applications that can withstand different kinds of failures. For example, in a data center, hardware is going to fail all the time.

Read Post

Reliably

Read more about What Does It Mean To Build Resilient Service Applications?

What SREs Can Learn from the Atlassian Nightmare Outage of 2022

May 13, 2022 By Weihan Li In Rootly

What happens when the tools and services you depend on to drive Site Reliability Engineering turn out to be susceptible to reliability failures of their own? That’s the question that teams at about 400 businesses have presumably had to ask themselves this month in the wake of a major outage in Atlassian Cloud.

Read Post

Rootly

Read more about What SREs Can Learn from the Atlassian Nightmare Outage of 2022

Continuous Deployment vs. Delivery | Differences Explained

May 12, 2022 By Noor-ul-Anam Ruqayya In Blameless

Curious about continuous deployment vs delivery? We explain what each is, what happens in each step, and their importance in the DevOps lifecycle.

Read Post

Blameless

Read more about Continuous Deployment vs. Delivery | Differences Explained

How The Experts Build Reliable Cloud Apps

May 11, 2022 By Emily Arnott In Blameless

We live in the cloud era, where your services don’t live in machines in your garage, but are spread across huge data centers around the world. Cloud providers can help meet increasing demands for reliability – for example, they offer dynamic resource allocation that can handle usage spikes. At the same time, going cloud native means not having a physical server onsite that you can fiddle with, introducing its own unique challenges. ‍

Read Post

Blameless

Read more about How The Experts Build Reliable Cloud Apps

Software Reliability Metrics That Matter To Engineers

May 11, 2022 By Ben Johnson In Reliably

Software reliability is the probability of failure-free operations in a computer program for a specified period of time in a specified environment. It is critical for validation in order to determine characteristics in terms of system performance, functional compatibility, maintenance, competency, installation coverage and process documentation continuance. Software reliability helps you to identify and fix bugs, improve performance, and test features.

Read Post

Reliably

Read more about Software Reliability Metrics That Matter To Engineers

DevOps Pipeline | Best Practices, Tips, & Techniques

May 10, 2022 By Noor-ul-Anam Ruqayya In Blameless

Looking into DevOps pipelines? We explain what a DevOps pipeline is, how to build one, and the best practices for building one for your team.

Read Post

Blameless

Read more about DevOps Pipeline | Best Practices, Tips, & Techniques

How Sumo SREs manage and monitor SLOs as Code with OpenSLO

May 10, 2022 By Drew Horn In Sumo Logic

At Nobl9’s annual SLOconf—the first conference dedicated to helping SREs quantify the reliability of their applications through service level objectives (SLOs)—Sumo Logic shared our contribution of slogen to the OpenSLO community, as well as our commitment to OpenSLO as an emerging standard for expressing SLOs as Code. slogen is an open source, SLO-as-code CLI tool based on the OpenSLO specification.

Read Post

Sumo Logic

Read more about How Sumo SREs manage and monitor SLOs as Code with OpenSLO

DevOps Vs SRE: The Main Differences

May 8, 2022 By Aimee Pearcy In Reliably

Site reliability engineering (SRE) is a set of principles that incorporates aspects of software engineering into IT operations. It takes tasks that would typically have been done manually by operations teams and gives them to engineers to solve using software and automation. This helps to create a bridge between development and operations teams. The concept of SRE was created by Google back in 2003. Since then, it has been adopted by thousands of organizations all over the world.

Read Post

Reliably

Read more about DevOps Vs SRE: The Main Differences

Observability Vs Monitoring: What's The Difference?

May 8, 2022 By Mbaoma Mary In Reliably

Clients expect prompt implementation of changes to their software, and this requirement motivates site reliability engineers to incorporate reliability into applications. The healthy practice of observability and monitoring can improve the reliability and security of software systems. Monitoring is the recording and interpreting data from software systems to keep track of their performance.

Read Post

Reliably

Read more about Observability Vs Monitoring: What's The Difference?

How to: Reliability Insights Overview in Blameless

May 6, 2022 By Blameless In Blameless

In this video, our Solutions Engineer walks you through the Reliability Insights view in Blameless. Discover how to create custom data dashboards. You might start with MTTX metrics, but what other metrics are reliability teams following closely? We'll show you how to get those set up in Blameless.

View Video

Blameless

Read more about How to: Reliability Insights Overview in Blameless

[Webinar] Unlock self-service infrastructure monitoring with the Sensu Integration Catalog

May 5, 2022 By Sensu In Sensu

Introducing the Sensu Integration Catalog — a marketplace-like UX for simplifying new user onboarding, and deploying production-ready monitoring in a matter of minutes. The Sensu Integration Catalog is also an open marketplace that new and existing users can contribute to by sharing Sensu configurations. Backed by industry-leading monitoring as code solution, Sensu provides new users with a point-and-click interface to get started quickly, while facilitating DevOps and SRE automation best practices.

View Video

Sensu

Read more about [Webinar] Unlock self-service infrastructure monitoring with the Sensu Integration Catalog

Are your SLOs realistic? How to analyze your risks like an SRE

May 4, 2022 By Ayelet Sachto In Google Operations

Setting up Service Level Objectives (SLOs) is one of the foundational tasks of Site Reliability Engineering (SRE) practices, giving the SRE team a target against which to evaluate whether or not a service is running reliably enough. The inverse of your SLO is your error budget — how much unreliability you are willing to tolerate.

Read Post

Google Operations

Read more about Are your SLOs realistic? How to analyze your risks like an SRE

How to Achieve Measurable Reliability Results

May 4, 2022 By Emily Arnott In Blameless

Reliability is more important than ever. As users depend on services more and more, and competition in every sector grows, a great digital experience becomes the baseline for expectations, not the ceiling. It’s crucial to invest in making your software reliable enough to keep customers happy. ‍ But what does investing in reliability look like?

Read Post

Blameless

Read more about How to Achieve Measurable Reliability Results

The Reverse Red Herring

May 4, 2022 By Geoff Townsend In Blameless

During an incident, time is fungible. At points it seems to go way too fast, and at times it seems like an eternity for a command to complete. More importantly, however, is how it feels to be in an incident. It’s a heightened state of being, where any and every piece of information could be “the one” that helps crack open what is really going on. Likewise, there is an inherent distrust of incoming information.

Read Post

Blameless

Read more about The Reverse Red Herring

CI/CD Pipeline | What It Is & How It Works

May 3, 2022 By Myra Nizami In Blameless

Wondering about CI/CD pipelines? We explain what the CI/CD pipeline is, the steps involved, and best practices along the way.

Read Post

Blameless

Read more about CI/CD Pipeline | What It Is & How It Works

NewsKit API: The journey of building reliability into our systems at News UK

May 3, 2022 By Reliably In Reliably

Starting small and currently serving billions of requests per month is never an easy path. Stoyan Yanev, Principal Engineer and Krasimir Petrov, Senior Software Engineer at News UK will show how they built their infrastructure and the decisions and compromises that had to be made along the way. The talk will be centered around NewsKits API and the importance of Reliability before opening up a discussion among the group.

View Video

Reliably

DevOps
SRE

Read more about NewsKit API: The journey of building reliability into our systems at News UK

How To Reduce Technical Debt

May 2, 2022 By Aimee Pearcy In Reliably

Technical debt is the implied cost of the additional work that is required when a team chooses a quick, easy solution that is limited, instead of a more well-thought-out, higher-quality solution that would take longer. Essentially, it’s what happens when teams prioritize speed over quality. Examples of technical debt include untested code, unreadable code, dead code, duplicated code, or outdated documentation.

Read Post

Reliably

Read more about How To Reduce Technical Debt

Operations | Monitoring | ITSM | DevOps | Cloud

May 2022

Getting AWS CloudTrail alerts via SNS Endpoint

DevOps Team Structure | Roles & Responsibilities

Simplifying SLO and Error Budget tracking for SRE teams

5 Tips If You're the 1st SRE Hire by Instacart's First SRE

SRE basics: Understanding SLAs, SLOs and SLIs

10 Reasons You Need A Service Level Agreement & Why It's important

What Is DevOps Automation & What Are The Benefits?

Error Budgets: Ultimate SRE Guide For Teams

5 Key Requirements of Modern Enterprise Monitoring & Observability Platforms

SRE: From Theory to Practice | What's difficult about incident command

Shift Left Reliability meetup - May Fifteen minutes or bust

Severity vs. Priority | Understanding the Differences

Is It Really An Incident?

A Chat with Lex Neva of SRE Weekly

The Journey Of Building Reliability And Scaling Your Systems

What Does It Mean To Build Resilient Service Applications?

What SREs Can Learn from the Atlassian Nightmare Outage of 2022

Continuous Deployment vs. Delivery | Differences Explained

How The Experts Build Reliable Cloud Apps

Software Reliability Metrics That Matter To Engineers

DevOps Pipeline | Best Practices, Tips, & Techniques

How Sumo SREs manage and monitor SLOs as Code with OpenSLO

DevOps Vs SRE: The Main Differences

Observability Vs Monitoring: What's The Difference?

How to: Reliability Insights Overview in Blameless

[Webinar] Unlock self-service infrastructure monitoring with the Sensu Integration Catalog

Are your SLOs realistic? How to analyze your risks like an SRE

How to Achieve Measurable Reliability Results

The Reverse Red Herring

CI/CD Pipeline | What It Is & How It Works

NewsKit API: The journey of building reliability into our systems at News UK

How To Reduce Technical Debt

Monthly Archive

Follow Us