%term

The latest News and Information on Service Reliability Engineering and related technologies.

Using Observability with Kubernetes to Automate Site Reliability Engineering

Sep 8, 2022 By StackState In StackState

In this video, Anthony Evans, solution architect, explains how the StackState topology-powered observability platform can help SREs to automate site reliability, putting their organizations on the path to becoming a zero-downtime enterprise. See how StackState helps to unify and correlate data across your stack, visualize your entire IT environment, instantly pinpoint root cause, reduce alert storms and with AIOps capabilities, even prevent problems proactively. It's all here!

View Video

StackState

Read more about Using Observability with Kubernetes to Automate Site Reliability Engineering

What is a Security Operation Center and how do SOC teams work?

Sep 6, 2022 By Vishal Padghan In Squadcast

With the growing complexity of IT environments, it is essential to have robust security processes that can safeguard IT environments from cyber threats. In this blog, we will explore how security operation centers (SOCs), help you monitor, identify and prevent cyber threats to safeguard your IT environments. This blog covers the following pointers.

Read Post

Squadcast

Read more about What is a Security Operation Center and how do SOC teams work?

What are the four Golden Signals?

Sep 2, 2022 By Andre Newman In Gremlin

When it comes to building reliable and scalable software, few organizations have as much authority and expertise as Google. Their Site Reliability Engineering Handbook, first published in 2016, details their practices to maintain reliability as Google scaled. But when you have over a million servers running thousands of services across more than twenty data centers, how do you monitor them in a consistent, logical, and relevant way?

Read Post

Gremlin

Read more about What are the four Golden Signals?

How to add a Golden Signal to a service in Gremlin RM

Sep 2, 2022 By Gremlin In Gremlin

In this video, we show you how to add a Golden Signal to a service. Gremlin uses your Golden Signals to ensure your services are still healthy and responsive during reliability tests. You can configure Golden Signals to use an existing monitor in your observability tools, such as Datadog, New Relic, or Prometheus. We recommend adding all four Golden Signals to each of your services to ensure comprehensive coverage.

View Video

Gremlin

Read more about How to add a Golden Signal to a service in Gremlin RM

How to add a Service to Gremlin Reliability Management (RM)

Sep 2, 2022 By Gremlin In Gremlin

This short demo video shows you how to add a Kubernetes service to Gremlin Reliability Management (RM). We'll walk you through selecting the parts of your infrastructure that make up your service, identifying processes for dependency detection, and adding your Golden Signals.

View Video

Gremlin

Read more about How to add a Service to Gremlin Reliability Management (RM)

Introduction to Gremlin Reliability Management (RM)

Sep 2, 2022 By Gremlin In Gremlin

Gremlin Reliability Management helps teams standardize and automate reliability, one service at a time. In this video, we walk through the platform by showing you how to add your services to Gremlin, integrate your Golden Signals, run reliability tests, and generate reliability scores.

View Video

Gremlin

Read more about Introduction to Gremlin Reliability Management (RM)

Round Robin Escalation: An Efficient Way to Distribute On-Call Responsibilities

Aug 30, 2022 By Vishal Padghan In Squadcast

Nowadays, organizations address a high volume of incidents everyday. With so much happening, responders can be overwhelmed by the volume of incidents and may end up de-prioritizing certain important incidents. Hence, it is important to have an efficient on-call scheduling and escalation process in place. In this blog, we will explore how Round Robin Escalations can help distribute on-call load and set up efficient on-call schedules. This blog covers the following pointers.

Read Post

Squadcast

Read more about Round Robin Escalation: An Efficient Way to Distribute On-Call Responsibilities

The SRE's Quick Guide to Kubectl Logs

Aug 28, 2022 By Eyal Katz In Lightrun

Logs are key to monitoring the performance of your applications. Kubernetes offers a command line tool for interacting with the control plane of a Kubernetes cluster called Kubectl. This tool allows debugging, monitoring, and, most importantly, logging capabilities. There are many great tools for SREs. However, Kubernetes supports Site Reliability Engineering principles through its capacity to standardize the definition, architecture, and orchestration of containerized applications.

Read Post

Lightrun

Read more about The SRE's Quick Guide to Kubectl Logs

SRE vs. DevOps: Differences and Similarities

Aug 26, 2022 By Emiliano Pardo Saguier In InvGate

Organizations scramble to adopt new frameworks and methodologies to make the software more scalable. Plus, they need to do it in a reliable way that doesn’t cause more problems. Enter Site Reliability Engineering (SRE), a set of practices introduced by a Google engineer. But how does it stack up to frameworks like DevOps? DevOps and SRE both enhance the software development and product release cycle.

Read Post

InvGate

Read more about SRE vs. DevOps: Differences and Similarities

Healthchecks + Squadcast Integration: Routing Alerts Made Easy

Aug 26, 2022 By Vishal Padghan In Squadcast

Healthchecks is a cron job monitoring service which listens to HTTP requests and email messages ("pings") from your cron jobs and scheduled tasks ("checks"). It lets you update your job to send an HTTP request to the ping URL every time the job runs. When your job does not ping Healthchecks.io on time, then you will receive an alert! If you use Healthchecks for your monitoring needs, you can now integrate it with Squadcast to route detailed alerts from Healthchecks to the right users in Squadcast.

Read Post