August 2022

Software Metrics Every SRE Team Should Measure

Aug 31, 2022 By Myra Nizami In Blameless

Software metrics give important insight into the performance of your product, but which ones matter most to SRE teams? How do you decide which metrics to track?

Read Post

Blameless

Read more about Software Metrics Every SRE Team Should Measure

Round Robin Escalation: An Efficient Way to Distribute On-Call Responsibilities

Aug 30, 2022 By Vishal Padghan In Squadcast

Nowadays, organizations address a high volume of incidents everyday. With so much happening, responders can be overwhelmed by the volume of incidents and may end up de-prioritizing certain important incidents. Hence, it is important to have an efficient on-call scheduling and escalation process in place. In this blog, we will explore how Round Robin Escalations can help distribute on-call load and set up efficient on-call schedules. This blog covers the following pointers.

Read Post

Squadcast

Read more about Round Robin Escalation: An Efficient Way to Distribute On-Call Responsibilities

What is reliability engineering?

Aug 30, 2022 By Aimee Pearcy In Reliably

Reliability engineering focuses on the ability of systems to perform as it is intended to and function without failure in a specified environment, for the required time duration. Reliability engineering can be applied across the entire lifecycle of software development. It is designed to increase the dependability of a product by detecting potential reliability issues early in the software development cycle, and correcting causes of failure that do occur.

Read Post

Reliably

Read more about What is reliability engineering?

Are Code Freezes Still Needed?

Aug 30, 2022 By Mbaoma Mary In Reliably

A code freeze means no code can be altered or modified during the frozen time, and developers will not make any additional changes. Developers can only modify the code in the event of critical flaws and to the extent required to correct those vital problems. Primarily developers observe a code freeze during the final phase of software development when the software product has reached the delivery state.

Read Post

Reliably

Read more about Are Code Freezes Still Needed?

The SRE's Quick Guide to Kubectl Logs

Aug 28, 2022 By Eyal Katz In Lightrun

Logs are key to monitoring the performance of your applications. Kubernetes offers a command line tool for interacting with the control plane of a Kubernetes cluster called Kubectl. This tool allows debugging, monitoring, and, most importantly, logging capabilities. There are many great tools for SREs. However, Kubernetes supports Site Reliability Engineering principles through its capacity to standardize the definition, architecture, and orchestration of containerized applications.

Read Post

Lightrun

Read more about The SRE's Quick Guide to Kubectl Logs

Healthchecks + Squadcast Integration: Routing Alerts Made Easy

Aug 26, 2022 By Vishal Padghan In Squadcast

Healthchecks is a cron job monitoring service which listens to HTTP requests and email messages ("pings") from your cron jobs and scheduled tasks ("checks"). It lets you update your job to send an HTTP request to the ping URL every time the job runs. When your job does not ping Healthchecks.io on time, then you will receive an alert! If you use Healthchecks for your monitoring needs, you can now integrate it with Squadcast to route detailed alerts from Healthchecks to the right users in Squadcast.

Read Post

Squadcast

Read more about Healthchecks + Squadcast Integration: Routing Alerts Made Easy

Introduction to Service Catalog | Service Ownership | Service Classification Squadcast

Aug 26, 2022 By Squadcast In Squadcast

To make service management a breeze, we bring to you our improved Service Catalog. The Service Catalog is designed to improve Service Classification and bring more transparency to Service Ownership within your org. This video explains how a consolidated summary of all active services from a single dashboard can help you better track your service health.

View Video

Squadcast

Read more about Introduction to Service Catalog | Service Ownership | Service Classification Squadcast

SRE vs. DevOps: Differences and Similarities

Aug 26, 2022 By Emiliano Pardo Saguier In InvGate

Organizations scramble to adopt new frameworks and methodologies to make the software more scalable. Plus, they need to do it in a reliable way that doesn’t cause more problems. Enter Site Reliability Engineering (SRE), a set of practices introduced by a Google engineer. But how does it stack up to frameworks like DevOps? DevOps and SRE both enhance the software development and product release cycle.

Read Post

InvGate

Read more about SRE vs. DevOps: Differences and Similarities

What are Runbooks? And why are they needed?

Aug 25, 2022 By Vardhan NS In Squadcast

Imagine being an Ops engineer in a team just struck by tragedy. Alarms start ringing, and incident response is in full force. It may sound like the situation is in control. WRONG! There's panic everywhere. The on-call team is scrambling for the heavenly door to redemption. But, the only thing that doesn't stop - Stakeholder Inquiries. This situation is bad. But it could be worse. Now imagine being a less-experienced Ops engineer in a relatively small on-call team struck by tragedy. If you don't have sufficient guidance, let alone moral support- you're toast.

Read Post

Squadcast

Read more about What are Runbooks? And why are they needed?

Using StatusPage at squadcast | SRE Best practices | Squadcast

Aug 25, 2022 By Squadcast In Squadcast

Let your customers know how your Services are doing, without them having to ask you about it. One of the core principles of SRE is Transparency and Status Pages help you communicate the status of your Services to your customers at all times, as opposed to you getting to know the status of your Services through support tickets logged by your customers.

View Video

Squadcast

Read more about Using StatusPage at squadcast | SRE Best practices | Squadcast

What are Canary Deployments and Why are they Important?

Aug 25, 2022 By Vishal Padghan In Squadcast

Every modification to software comes with the potential for production problems. Application failures often have serious consequences which can result in a loss of revenue and a poor customer experience. Additionally, organizations constantly try to improve their services for a better customer experience. How can you minimize the chance of error and update your application with confidence?

Read Post

Squadcast

Read more about What are Canary Deployments and Why are they Important?

Performing Postmortems & Postmortem Templates at Squadcast | SRE Best practices | Squadcast

Aug 25, 2022 By Squadcast In Squadcast

Postmortems are a way to summarize the resolution for an incident once it is resolved. It is also a way for you to create a knowledge-base of failures and fixes that can be shared across your team to help build a culture of shared learning and learning from failures.

View Video

Squadcast

Read more about Performing Postmortems & Postmortem Templates at Squadcast | SRE Best practices | Squadcast

Site Reliability Engineering, Site Reliability Engineers and SRE Practices: State of Adoption

Aug 24, 2022 By Heidi Gilmore In StackState

Site reliability engineering (SRE) is what you get when you treat operations as if it’s a software problem. The mission of an SRE practice is to protect, provide for and progress the software and systems offered and managed by an organization with an ever-watchful eye on their availability, latency, performance and capacity.1.

Read Post

StackState

Read more about Site Reliability Engineering, Site Reliability Engineers and SRE Practices: State of Adoption

What is an SRE job description?

Aug 24, 2022 By Myra Nizami In Blameless

Whether you’re building an SRE team or looking for a job as an SRE, understanding the SRE job description is important. How would you define an SRE job?

Read Post

Blameless

Read more about What is an SRE job description?

Blameless Announces Ming Gong as New VP of Product Management

Aug 23, 2022 By Blameless In Blameless

Former Product Leader for Atlassian's Bitbucket Cloud to Lead Blameless' Product Vision and Innovation.

Read Post

Blameless

Read more about Blameless Announces Ming Gong as New VP of Product Management

Site Reliability Engineering: Definition, Principles & How It Differs From DevOps

Aug 22, 2022 By MoovingON In MoovingON

Site crashes and outages can cost hundreds of thousands in lost revenue and inconvenience users. Site Reliability Engineering helps build highly reliable and scalable systems, particularly important for companies that depend on their software to support their customers performing critical operations. Hiring a Site Reliability Engineer is the best way to ensure a software system stays up and running at all times. Not only will they help manage infrastructure and applications, but they'll also be able to advise on how to scale a business as it grows - keeping downtime and incidents at a minimum!

Read Post

MoovingON

Read more about Site Reliability Engineering: Definition, Principles & How It Differs From DevOps

Uptime + Squadcast Integration: Routing Alerts Made Easy

Aug 18, 2022 By Vishal Padghan In Squadcast

Uptime is a site monitoring solution used to reach various endpoints & notify users via push notifications when downtime is detected. It collects and stores downtime & response time data & which is then made available as reports to the users. If you use Uptime for your monitoring needs, you can now integrate it with Squadcast to route detailed alerts from Uptime to the right users in Squadcast. The below steps will help you set up Uptime and Squadcast integration.

Read Post

Squadcast

Read more about Uptime + Squadcast Integration: Routing Alerts Made Easy

geeks+gurus: Rise of SRE - Survey Insights

Aug 18, 2022 By Sumo Logic In Sumo Logic

Site Reliability Engineering (SRE) continues to rise in adoption. Teams that leverage SRE “good” practices are benefitting, individuals are excited about their jobs and IT and the business are collaborating more efficiently. Sounds interesting? We hope so, as there are a few key insights which you should know. Join us to learn more about the exciting journey of SRE. We have partnered with DevOps Institute (DOI) to conduct their inaugural 2022 Global SRE Pulse Survey, and we are excited to share the pulse on SRE.

View Video

Sumo Logic

DevOps
SRE

Read more about geeks+gurus: Rise of SRE - Survey Insights

Chaos Engineering: What Is It & How Does It Work?

Aug 17, 2022 By Noor-ul-Anam Ruqayya In Blameless

Distributed software systems have many points of failure. Can the process of chaos engineering help identify problems and gauge resiliency?

Read Post

Blameless

Read more about Chaos Engineering: What Is It & How Does It Work?

10 Ways You Can Improve Service Reliability

Aug 12, 2022 By Aimee Pearcy In Reliably

Software reliability can be defined as the probability of a failure-free operation of a computer system over a specified period, under a set of specific conditions. It is an important factor in determining software quality. Site reliability engineering (SRE) is a software approach to IT operations that helps organizations to improve the reliability of their systems.

Read Post

Reliably

Read more about 10 Ways You Can Improve Service Reliability

Comparing DBA, DBRE, and SRE Roles

Aug 11, 2022 By Aaron Bertrand In SolarWinds

As I navigate further into my career, I’m finding the scope of my role has shifted over the years. I thought I’d take some time to help relay the differences I’ve seen between traditional database administrators (DBAs), database reliability engineers (DBREs), and site reliability engineers (SREs). Before I start, I want to get a disclaimer out of the way: some of the comparisons here reflect only what I’ve seen and may not match what you’ve experienced.

Read Post

SolarWinds

Read more about Comparing DBA, DBRE, and SRE Roles

Tales from the Toil: Taking the pulse of SRE

Aug 9, 2022 By Sam Fell In Sumo Logic

Site Reliability Engineering (SRE) is a growing practice essential for enterprises to ensure service delivery, reliability, and access for users. Many companies only choose to invest in SRE when they have a raging operational fire on their hands. As a result, SREs often start out as firefighters, desperately trying to keep the service online for one more day.

Read Post

Sumo Logic

Read more about Tales from the Toil: Taking the pulse of SRE

How to Become a Site Reliability Engineer: Job Description, Roles & Responsibilities

Aug 8, 2022 By Emiliano Pardo Saguier In InvGate

Site Reliability Engineering (SRE) is still going strong in the world of software development. As a bridge between developments and operations, it’s a necessary part of any organization that wants to work like a well-oiled machine. Simply put, SRE tries to fix a widespread problem in organizations: siloing. But not much is known about the job requirements of becoming a site reliability engineer.

Read Post

InvGate

Read more about How to Become a Site Reliability Engineer: Job Description, Roles & Responsibilities

SRE: From Theory to Practice | What's difficult about tech debt?

Aug 4, 2022 By Emily Arnott In Blameless

In episode 3 of From Theory to Practice, Blameless’s Matt Davis and Kurt Andersen were joined by Liz Fong-Jones of Honeycomb.io and Jean Clermont of Flatiron to discuss two words dreaded by every engineer: technical debt. So what is technical debt? Even if you haven’t heard the term, I’m sure you’ve experienced it: parts of your system that are left unfixed or not quite up to par, but no one seems to have the time to work on. ‍

Read Post

Blameless

Read more about SRE: From Theory to Practice | What's difficult about tech debt?

7 Key SRE Principles For Better Reliability

Aug 3, 2022 By Mboama Mary In Reliably

Since Google coined the term, the role of an SRE has evolved as the industry has shifted toward large-scale distributed microservices. An SRE’s job is to determine how to make systems more reliable and resilient.

Read Post

Reliably

Read more about 7 Key SRE Principles For Better Reliability

Blameless Demo: Streamline ServiceNow Incident Ticketing Workflows

Aug 3, 2022 By Blameless In Blameless

Our Director of Product, Nicolas Phillip, shows you how to create ServiceNow incident tickets from your preferred chat tool or the Blameless interface. Watch his step-by-step tutorial and begin leveraging Blameless to create incident tickets in ServiceNow today.

View Video

Blameless

Read more about Blameless Demo: Streamline ServiceNow Incident Ticketing Workflows

Anti-patterns in Incident Response that you should unlearn

Aug 2, 2022 By Vishal Padghan In Squadcast

It is important to invest time and effort in understanding why a system performs the way it does and how we can improve it. Companies continue with practices that yield successful results, but ignoring anti-patterns can be far worse than choosing rigid processes. In this blog we will explore anti-patterns in incident response and why you should unlearn those.

Read Post

Squadcast

Read more about Anti-patterns in Incident Response that you should unlearn

Analytics in Squadcast | Incident Management | On-call | SRE | Squadcast

Aug 1, 2022 By Squadcast In Squadcast

Analyzing incident data plays a key role to do better SRE. Squadcast's Analytics Dashboard helps you analyze the performance of your Organization/ Team, for a given time period. It also gives you more insight into past outages that affected your systems.

View Video

Squadcast

Read more about Analytics in Squadcast | Incident Management | On-call | SRE | Squadcast

Operations | Monitoring | ITSM | DevOps | Cloud

August 2022

Software Metrics Every SRE Team Should Measure

Round Robin Escalation: An Efficient Way to Distribute On-Call Responsibilities

What is reliability engineering?

Are Code Freezes Still Needed?

The SRE's Quick Guide to Kubectl Logs

Healthchecks + Squadcast Integration: Routing Alerts Made Easy

Introduction to Service Catalog | Service Ownership | Service Classification Squadcast

SRE vs. DevOps: Differences and Similarities

What are Runbooks? And why are they needed?

Using StatusPage at squadcast | SRE Best practices | Squadcast

What are Canary Deployments and Why are they Important?

Performing Postmortems & Postmortem Templates at Squadcast | SRE Best practices | Squadcast

Top SRE Interview Questions You Should Know

Site Reliability Engineering, Site Reliability Engineers and SRE Practices: State of Adoption

What is an SRE job description?

Blameless Announces Ming Gong as New VP of Product Management

Site Reliability Engineering: Definition, Principles & How It Differs From DevOps

Uptime + Squadcast Integration: Routing Alerts Made Easy

geeks+gurus: Rise of SRE - Survey Insights

Chaos Engineering: What Is It & How Does It Work?

10 Ways You Can Improve Service Reliability

Comparing DBA, DBRE, and SRE Roles

Tales from the Toil: Taking the pulse of SRE

How to Become a Site Reliability Engineer: Job Description, Roles & Responsibilities

SRE: From Theory to Practice | What's difficult about tech debt?

7 Key SRE Principles For Better Reliability

Blameless Demo: Streamline ServiceNow Incident Ticketing Workflows

Anti-patterns in Incident Response that you should unlearn

Analytics in Squadcast | Incident Management | On-call | SRE | Squadcast

Monthly Archive

Follow Us