July 2023

1979, a nuclear accident and SRE

Jul 31, 2023 By Aniket Rao In Last9

Deep diving into the 'Normal accident' theory by Charles Perrow, and what it means for SREs.

Read Post

Last9

Read more about 1979, a nuclear accident and SRE

Ingest OpenTelemetry metrics with Prometheus natively

Jul 29, 2023 By Prathamesh Sonpatki In Last9

Native support for OpenTelemetry metrics in Prometheus.

Read Post

Last9

Read more about Ingest OpenTelemetry metrics with Prometheus natively

Kubernetes Monitoring Best Practices

Jul 28, 2023 By Squadcast Community In Squadcast

Kubernetes can be installed using different tools, whether open-source, third-party vendor, or in a public cloud. In most cases, default installations have limited monitoring capabilities. Therefore, once a Kubernetes cluster is running, administrators must implement monitoring solutions to meet their requirements. Typical use cases for Kubernetes monitoring include: Effective Kubernetes monitoring requires a mix of tools, strategy, and technical expertise. To help you get it right, this article will explore seven essential Kubernetes monitoring best practices in detail.

Read Post

Squadcast

Read more about Kubernetes Monitoring Best Practices

How to make high cardinality work in time series databases: Part 1

Jul 28, 2023 By Piyush Verma, In Last9

Part 1 of the series of posts which talk about engineering design decisions to make high cardinality work in time-series databases.

Read Post

Last9

Read more about How to make high cardinality work in time series databases: Part 1

The Medium is the Message: How to Master the Most Essential Incident Communication Channels

Jul 28, 2023 By Ashley Sawatsky In Rootly

We’ve all seen it: a company experiencing a major incident and going radio silent, leaving their customers to wonder “Are they doing something about this?!”. If you’ve ever been on the inside of something like this, you know the answer is most likely yes, there are people working hard to put out the fire as quickly as possible. But when it comes to incidents, perception is reality for customers.

Read Post

Rootly

Read more about The Medium is the Message: How to Master the Most Essential Incident Communication Channels

Looking Beyond Atlassian StatusPage: The 5 Best Alternatives

Jul 27, 2023 By Sanjog Sandhu In Squadcast

Status Pages are crucial cogs in your Incident Communication process, they serve as vital channels to keep your stakeholders informed during periods of downtime. Although there are many proficient tools in the market, such as Atlassian Status Page and Status.io, these standalone Status Pages can come with a hefty price tag, with various pricing plans and tiers for both Public and Private Status Pages. Moreover, with Atlassian Cloud’s recent issues, its dependability is in question.

Read Post

Squadcast

Read more about Looking Beyond Atlassian StatusPage: The 5 Best Alternatives

OpenTelemetry for dummies: ELI5

Jul 27, 2023 By Mohan Dutt Parashar In Last9

What is OpenTelemetry? Why is it important? Do SREs need to adopt OTel? An Explain It Like I'm 5.

Read Post

Last9

Read more about OpenTelemetry for dummies: ELI5

OpenTelemetry vs. Prometheus

Jul 26, 2023 By Last9 In Last9

OpenTelemetry vs. Prometheus - Difference in architecture, and metrics.

Read Post

Last9

Read more about OpenTelemetry vs. Prometheus

Breaking Down the Pillars of Observability from Data to Outcomes

Jul 25, 2023 By Last9 In Last9

The world of cloud-native and distributed microservices has revolutionized software development and deployment. However, the sheer volume of data these systems generate can often lead to confusion and uncertainty. You're not alone if you've ever felt lost in the sea of observability data.

View Video

Last9

Read more about Breaking Down the Pillars of Observability from Data to Outcomes

Webinar: Embracing Declarative Provisioning and Observability in cloud environments

Jul 24, 2023 By Last9 In Last9

Organizations face increasingly complex challenges in deploying and managing their systems in today's rapidly evolving technological landscape. Declarative provisioning and observability have emerged as a powerful approach to address these challenges. This talk delves into declarative provisioning and observability, exploring its benefits, principles, and practical implementation strategies.

View Video

Last9

Read more about Webinar: Embracing Declarative Provisioning and Observability in cloud environments

Introduction to ELK Tech Stack

Jul 21, 2023 By Chitra Bisht In Squadcast

ELK Stack, also known as the Elastic Stack is a powerful and versatile open-source toolset that has revolutionized the way businesses manage and analyze their data. ELK Stack seamlessly integrates these three robust components to offer a comprehensive solution for searching, analyzing, and visualizing large volumes of data in real-time. So, buckle up, for a comprehensive overview of the ELK stack and its components, which will be a great starting point for beginners.

Read Post

Squadcast

Read more about Introduction to ELK Tech Stack

Pinpoint performance issues in downstream services with the Dependency Map Navigator

Jul 21, 2023 By Scott Richardson In Datadog

Visibility into the upstream and downstream dependencies of your services is key to maintaining a performant microservices environment. Application developers and SREs rely on this visibility to quickly trace issues back to the source, which is essential during incidents—when time is of the essence—throughout day-to-day operations, and as systems evolve and scale.

Read Post

Datadog

Read more about Pinpoint performance issues in downstream services with the Dependency Map Navigator

Blameless Unveils Multibot Support, Empowering Enterprise Security Teams to Manage Incidents on their Terms

Jul 20, 2023 By Blameless In Blameless

Leading Incident Management Solution's New Multibot Feature Allows SecOps Teams to Achieve Greater Flexibility and Convenience.

Read Post

Blameless

Read more about Blameless Unveils Multibot Support, Empowering Enterprise Security Teams to Manage Incidents on their Terms

Enhanced Incident Response: Maximizing Microsoft Teams with Squadcast

Jul 20, 2023 By Abhishek Sony In Squadcast

Off late more and more businesses are relying on ChatOps tools like Microsoft Teams for a range of functions beyond simple communication. Incident management is no exception to this growing trend. However, Microsoft Teams alone may not possess all the necessary capabilities to efficiently perform these functions. To bridge this gap, integration with core applications becomes necessary.

Read Post

Squadcast

Read more about Enhanced Incident Response: Maximizing Microsoft Teams with Squadcast

Mastering Zero Trust - Pillars for Security

Jul 20, 2023 By Emily Arnott In Blameless

Zero Trust is a heightened security measure that blocks people and devices from accessing company data by default, only allowing access to those who prove they require it. Zero Trust assumes restricted access to company resources by all: Anyone or anything accessing company resources requires verification each time the system is accessed. There are no options to “trust this device next time” or “save password for next time”.

Read Post

Blameless

Read more about Mastering Zero Trust - Pillars for Security

Templates for Automating Incident Response

Jul 20, 2023 By Emily Arnott In Blameless

A security incident is the last thing any DevOps lead wants to see. Along with the vast number of protocols required to overcome an incident, there’s a hefty amount of paperwork to complete. Security incidents can even lead to legal repercussions, if personal data is leaked. Incident response templates offer insight into: An incident response plan template drastically reduces the time and effort spent dealing with incident reports.

Read Post

Blameless

Read more about Templates for Automating Incident Response

Unveiling Multibot, the "glue" for enterprise workflows

Jul 19, 2023 By Alex Greer In Blameless

How are you delivering Slack incident management workflows that serve the many teams across your enterprise? How are you addressing the differences in their use cases, access needs, isolation needs, and tech stacks, all while enabling everyone to collaborate? These are challenging questions to answer. To effectively do so, you have a host of conditions to support at the team and company-wide levels: ‍ Team ‍ Company-wide ‍

Read Post

Blameless

Read more about Unveiling Multibot, the "glue" for enterprise workflows

Video: How to Apply the Golden Signals to Your Monitoring Strategy

Jul 19, 2023 By Brian Conn In Circonus

The Four Golden Signals, developed by Google SREs, are key metrics used to monitor the health of your systems. In today’s complex IT environments, these key metrics can help engineers and IT operations prioritize the most significant issues to address. The Four Golden Signals include: In the following 9-minute video, I focus on two of these signals in particular, latency and errors, because they often result in customer-facing symptoms.

Read Post

Circonus

Read more about Video: How to Apply the Golden Signals to Your Monitoring Strategy

8 Tips to incorporate the voice of the customer in your story grooming/sprint planning

Jul 18, 2023 By Anjali Udasi In Zenduty

Creating successful products and projects goes beyond just great ideas and flexible processes. It's about truly understanding and listening to your customers.Attentively listening to their wants and needs unlocks invaluable insights that can revolutionize your story planning and project execution. In this blog, we'll look at easy but powerful tips to use the customer's input during story planning.

Read Post

Zenduty

Read more about 8 Tips to incorporate the voice of the customer in your story grooming/sprint planning

Take back control of your Monitoring

Jul 18, 2023 By Last9 In Last9

The challenges in the monitoring world are known widely. We all know about these problems, what they are, and why they are important. While each one of the problems has its own solution, it all boils down to one thing – COST. How do we balance the tradeoffs without worrying about the huge costs of solving these challenges? For high-precision monitoring and observability, you need efficient and high-precision control levers. Take back control of your Monitoring with Levitate - a managed time series data warehouse.

View Video

Last9

Read more about Take back control of your Monitoring

What is OpenTelemetry Collector

Jul 17, 2023 By Last9 In Last9

What is OpenTelemetry Collector, Architecture, Deployment and Getting started.

Read Post

Last9

Read more about What is OpenTelemetry Collector

How JCB is leveraging SRE to lead a successful digital transformation

Jul 15, 2023 By Shimpei Sasano In Google Operations

How JCB improves team structure, risk management, and application and platform development.

Read Post

Google Operations

Read more about How JCB is leveraging SRE to lead a successful digital transformation

InfluxDB vs. Thanos

Jul 14, 2023 By Prathamesh Sonpatki In Last9

InfluxDB vs Thanos: Overview, Pros and Cons, and Differences.

Read Post

Last9

Read more about InfluxDB vs. Thanos

What Is Site Reliability Engineering? Understanding the complexities of this crucial function

Jul 14, 2023 By incident.io In Incident.io

Site reliability engineers manage a lot, and often in incredibly high-stakes environments. Remember that scene from "The Matrix" where Neo dodges bullets in slow motion? Of course you do. As an SRE, it can feel like you're the person getting hit by those bullets, frantically trying to investigate performance issues, automate away toil, and support the engineers around you, all before the next wave of attacks.

Read Post

Incident.io

Read more about What Is Site Reliability Engineering? Understanding the complexities of this crucial function

Share highly customizable Blameless Retrospectives as ServiceNow Problems

Jul 13, 2023 By Nicolas Philip In Blameless

For many organizations, ServiceNow is a crucial platform to run and scale your organization across all departments. Many organizations’ engineering teams have been relying on ServiceNow Incident and Problem Management. Despite that, many have been experiencing a growing volume of incidents hindering their ability to scale not only their incident response but also their retrospective operations, potentially compromising their data governance and compliance requirements.

Read Post

Blameless

Read more about Share highly customizable Blameless Retrospectives as ServiceNow Problems

Understanding Chaos Engineering and its Benefits

Jul 12, 2023 By Anjali Udasi In Zenduty

In today's fast-paced technological landscape, ensuring the resilience and dependability of systems is crucial. This is where Chaos Engineering comes in, transforming how organizations approach system testing and fortification. Chaos Engineering helps find vulnerabilities that could go undetected under normal circumstances by purposefully introducing controlled interruptions and failures.

Read Post

Zenduty

Read more about Understanding Chaos Engineering and its Benefits

26 DevOps Automation Tools that SaaS Loves in 2023 | Blameless

Jul 12, 2023 By Emily Arnott In Blameless

DevOps is a term combining “development” and “operations”. It involves the use of tools and processes to minimize the time and effort spent on software creation and maintenance. Many DevOps technologies use automation to reduce manual tasks. These DevOps automation tools sometimes use AI-based technology to remove human-based operations, or simpler scripting and processing. This increases speed in feedback and performance between development and operations departments.

Read Post

Blameless

Read more about 26 DevOps Automation Tools that SaaS Loves in 2023 | Blameless

Improve Visibility and Capture More Data with Triage Incidents

Jul 12, 2023 By Ashley Sawatsky In Rootly

As new incidents emerge, there are often many unknowns about the size, severity, and cause of the problem. Sometimes it’s not clear if the problem is an incident at all. That’s where introducing a triage stage to your incident management process can help. In this post, we’ll look at the benefits of adding a triage layer to your incident management, and how Rootly’s Triage feature allows you to seamlessly transition from triage to real incident (or false alarm).

Read Post

Rootly

Read more about Improve Visibility and Capture More Data with Triage Incidents

What Site Reliability Engineering needs - A swarm of rogue bees

Jul 11, 2023 By Aniket Rao In Last9

If all companies are software companies, all companies need better Observability to understand how performative their software is.

Read Post

Last9

Read more about What Site Reliability Engineering needs - A swarm of rogue bees

Prometheus vs. VictoriaMetrics (VM)

Jul 10, 2023 By Last9 In Last9

Comparing Prometheus vs. VictoriaMetrics (VM) - Scalability, Performance, Integrations.

Read Post

Last9

Read more about Prometheus vs. VictoriaMetrics (VM)

Prometheus vs. Cortex

Jul 7, 2023 By Last9 In Last9

Comparing Prometheus vs. Cortex - Scalability, Cost, Performance, Known Weaknesses.

Read Post

Last9

Read more about Prometheus vs. Cortex

The Incident Response Lifecycle: Strategies for Effective Incident Management

Jul 3, 2023 By Anjali Udasi In Zenduty

The nature of security and incident management is cyclical rather than linear. Resolving an issue doesn't mark the end of the team's responsibilities. Instead, it signals the opportunity to enhance reliability, strategize, prepare, and prevent similar problems. This is where the incident response helps and comes into the picture. But what is incident response, and what steps are included in the incident response lifecycle? Let's understand them in detail.

Read Post

Zenduty

Read more about The Incident Response Lifecycle: Strategies for Effective Incident Management

Docker Compose Logs: Guide & Best Practices

Jul 2, 2023 By Squadcast Community In Squadcast

Docker Compose is a tool for defining and running multi-container Docker applications. It allows developers to streamline the process of configuring, building, and running multiple containers as a single unit with a docker-compose.yml. This configuration file specifies the services, networks, and volumes required for an application, and their relationships and dependencies. The docker-compose logs command displays the logs of all services defined in the docker-compose.yml file.

Read Post

Squadcast

Read more about Docker Compose Logs: Guide & Best Practices

Operations | Monitoring | ITSM | DevOps | Cloud

July 2023

1979, a nuclear accident and SRE

Ingest OpenTelemetry metrics with Prometheus natively

Kubernetes Monitoring Best Practices

How to make high cardinality work in time series databases: Part 1

The Medium is the Message: How to Master the Most Essential Incident Communication Channels

Looking Beyond Atlassian StatusPage: The 5 Best Alternatives

OpenTelemetry for dummies: ELI5

OpenTelemetry vs. Prometheus

Breaking Down the Pillars of Observability from Data to Outcomes

Webinar: Embracing Declarative Provisioning and Observability in cloud environments

Introduction to ELK Tech Stack

Pinpoint performance issues in downstream services with the Dependency Map Navigator

Blameless Unveils Multibot Support, Empowering Enterprise Security Teams to Manage Incidents on their Terms

Enhanced Incident Response: Maximizing Microsoft Teams with Squadcast

Mastering Zero Trust - Pillars for Security

Templates for Automating Incident Response

Unveiling Multibot, the "glue" for enterprise workflows

Video: How to Apply the Golden Signals to Your Monitoring Strategy

8 Tips to incorporate the voice of the customer in your story grooming/sprint planning

Take back control of your Monitoring

What is OpenTelemetry Collector

How JCB is leveraging SRE to lead a successful digital transformation

InfluxDB vs. Thanos

What Is Site Reliability Engineering? Understanding the complexities of this crucial function

Share highly customizable Blameless Retrospectives as ServiceNow Problems

Understanding Chaos Engineering and its Benefits

26 DevOps Automation Tools that SaaS Loves in 2023 | Blameless

Improve Visibility and Capture More Data with Triage Incidents

What Site Reliability Engineering needs - A swarm of rogue bees

Prometheus vs. VictoriaMetrics (VM)

Prometheus vs. Cortex

The Incident Response Lifecycle: Strategies for Effective Incident Management

Docker Compose Logs: Guide & Best Practices

Monthly Archive

Follow Us