Monthly Archive

1979, a nuclear accident and SRE

Jul 31, 2023 By Aniket Rao In Last9

Deep diving into the 'Normal accident' theory by Charles Perrow, and what it means for SREs.

Read Post

Last9

Read more about 1979, a nuclear accident and SRE

Ingest OpenTelemetry metrics with Prometheus natively

Jul 29, 2023 By Prathamesh Sonpatki In Last9

Native support for OpenTelemetry metrics in Prometheus.

Read Post

Last9

Read more about Ingest OpenTelemetry metrics with Prometheus natively

Kubernetes Monitoring Best Practices

Jul 28, 2023 By Squadcast Community In Squadcast

Kubernetes can be installed using different tools, whether open-source, third-party vendor, or in a public cloud. In most cases, default installations have limited monitoring capabilities. Therefore, once a Kubernetes cluster is running, administrators must implement monitoring solutions to meet their requirements. Typical use cases for Kubernetes monitoring include: Effective Kubernetes monitoring requires a mix of tools, strategy, and technical expertise. To help you get it right, this article will explore seven essential Kubernetes monitoring best practices in detail.

Read Post

Squadcast

Read more about Kubernetes Monitoring Best Practices

The Medium is the Message: How to Master the Most Essential Incident Communication Channels

Jul 28, 2023 By Ashley Sawatsky In Rootly

We’ve all seen it: a company experiencing a major incident and going radio silent, leaving their customers to wonder “Are they doing something about this?!”. If you’ve ever been on the inside of something like this, you know the answer is most likely yes, there are people working hard to put out the fire as quickly as possible. But when it comes to incidents, perception is reality for customers.

Read Post

Rootly

Read more about The Medium is the Message: How to Master the Most Essential Incident Communication Channels

How we tame high cardinality in time series databases

Jul 28, 2023 By Piyush Verma, In Last9

Part 1 of the series of posts which talk about engineering design decisions to make high cardinality work in time-series databases.

Read Post

Last9

Read more about How we tame high cardinality in time series databases

Looking Beyond Atlassian StatusPage: The 5 Best Alternatives

Jul 27, 2023 By Sanjog Sandhu In Squadcast

Status Pages are crucial cogs in your Incident Communication process, they serve as vital channels to keep your stakeholders informed during periods of downtime. Although there are many proficient tools in the market, such as Atlassian Status Page and Status.io, these standalone Status Pages can come with a hefty price tag, with various pricing plans and tiers for both Public and Private Status Pages. Moreover, with Atlassian Cloud’s recent issues, its dependability is in question.

Read Post

Squadcast

Read more about Looking Beyond Atlassian StatusPage: The 5 Best Alternatives

OpenTelemetry for dummies: ELI5

Jul 27, 2023 By Mohan Dutt Parashar In Last9

What is OpenTelemetry? Why is it important? Do SREs need to adopt OTel? An Explain It Like I'm 5.

Read Post

Last9

Read more about OpenTelemetry for dummies: ELI5

OpenTelemetry vs. Prometheus

Jul 26, 2023 By Last9 In Last9

OpenTelemetry vs. Prometheus - Difference in architecture, and metrics.

Read Post

Last9

Read more about OpenTelemetry vs. Prometheus

Breaking Down the Pillars of Observability from Data to Outcomes

Jul 25, 2023 By Last9 In Last9

The world of cloud-native and distributed microservices has revolutionized software development and deployment. However, the sheer volume of data these systems generate can often lead to confusion and uncertainty. You're not alone if you've ever felt lost in the sea of observability data.

View Video

Last9

Read more about Breaking Down the Pillars of Observability from Data to Outcomes

Webinar: Embracing Declarative Provisioning and Observability in cloud environments

Jul 24, 2023 By Last9 In Last9

Organizations face increasingly complex challenges in deploying and managing their systems in today's rapidly evolving technological landscape. Declarative provisioning and observability have emerged as a powerful approach to address these challenges. This talk delves into declarative provisioning and observability, exploring its benefits, principles, and practical implementation strategies.

View Video

Last9

Read more about Webinar: Embracing Declarative Provisioning and Observability in cloud environments

Introduction to ELK Tech Stack

Jul 21, 2023 By Chitra Bisht In Squadcast

ELK Stack, also known as the Elastic Stack is a powerful and versatile open-source toolset that has revolutionized the way businesses manage and analyze their data. ELK Stack seamlessly integrates these three robust components to offer a comprehensive solution for searching, analyzing, and visualizing large volumes of data in real-time. So, buckle up, for a comprehensive overview of the ELK stack and its components, which will be a great starting point for beginners.

Read Post

Squadcast

Read more about Introduction to ELK Tech Stack

Pinpoint performance issues in downstream services with the Dependency Map Navigator

Jul 21, 2023 By Scott Richardson In Datadog

Visibility into the upstream and downstream dependencies of your services is key to maintaining a performant microservices environment. Application developers and SREs rely on this visibility to quickly trace issues back to the source, which is essential during incidents—when time is of the essence—throughout day-to-day operations, and as systems evolve and scale.

Read Post

Datadog

Read more about Pinpoint performance issues in downstream services with the Dependency Map Navigator

Enhanced Incident Response: Maximizing Microsoft Teams with Squadcast

Jul 20, 2023 By Abhishek Sony In Squadcast

Off late more and more businesses are relying on ChatOps tools like Microsoft Teams for a range of functions beyond simple communication. Incident management is no exception to this growing trend. However, Microsoft Teams alone may not possess all the necessary capabilities to efficiently perform these functions. To bridge this gap, integration with core applications becomes necessary.

Read Post

Squadcast

Read more about Enhanced Incident Response: Maximizing Microsoft Teams with Squadcast

Take back control of your Monitoring

Jul 18, 2023 By Last9 In Last9

The challenges in the monitoring world are known widely. We all know about these problems, what they are, and why they are important. While each one of the problems has its own solution, it all boils down to one thing – COST. How do we balance the tradeoffs without worrying about the huge costs of solving these challenges? For high-precision monitoring and observability, you need efficient and high-precision control levers. Take back control of your Monitoring with Levitate - a managed time series data warehouse.

View Video

Last9

Read more about Take back control of your Monitoring

What is OpenTelemetry Collector

Jul 17, 2023 By Last9 In Last9

What is OpenTelemetry Collector, Architecture, Deployment and Getting started.

Read Post

Last9

Read more about What is OpenTelemetry Collector

How JCB is leveraging SRE to lead a successful digital transformation

Jul 15, 2023 By Shimpei Sasano In Google Operations

How JCB improves team structure, risk management, and application and platform development.

Read Post

Google Operations

Read more about How JCB is leveraging SRE to lead a successful digital transformation

InfluxDB vs. Thanos

Jul 14, 2023 By Prathamesh Sonpatki In Last9

InfluxDB vs Thanos: Overview, Pros and Cons, and Differences.

Read Post

Last9

Read more about InfluxDB vs. Thanos

What Is Site Reliability Engineering? Understanding the complexities of this crucial function

Jul 14, 2023 By incident.io In Incident.io

Site reliability engineers manage a lot, and often in incredibly high-stakes environments. Remember that scene from "The Matrix" where Neo dodges bullets in slow motion? Of course you do. As an SRE, it can feel like you're the person getting hit by those bullets, frantically trying to investigate performance issues, automate away toil, and support the engineers around you, all before the next wave of attacks.

Read Post

Incident.io

Read more about What Is Site Reliability Engineering? Understanding the complexities of this crucial function

Improve Visibility and Capture More Data with Triage Incidents

Jul 12, 2023 By Ashley Sawatsky In Rootly

As new incidents emerge, there are often many unknowns about the size, severity, and cause of the problem. Sometimes it’s not clear if the problem is an incident at all. That’s where introducing a triage stage to your incident management process can help. In this post, we’ll look at the benefits of adding a triage layer to your incident management, and how Rootly’s Triage feature allows you to seamlessly transition from triage to real incident (or false alarm).

Read Post

Rootly

Read more about Improve Visibility and Capture More Data with Triage Incidents

What Site Reliability Engineering needs - A swarm of rogue bees

Jul 11, 2023 By Aniket Rao In Last9

If all companies are software companies, all companies need better Observability to understand how performative their software is.

Read Post

Last9

Read more about What Site Reliability Engineering needs - A swarm of rogue bees

Prometheus vs. VictoriaMetrics (VM)

Jul 10, 2023 By Last9 In Last9

Comparing Prometheus vs. VictoriaMetrics (VM) - Scalability, Performance, Integrations.

Read Post

Last9

Read more about Prometheus vs. VictoriaMetrics (VM)

Prometheus vs. Cortex

Jul 7, 2023 By Last9 In Last9

Comparing Prometheus vs. Cortex - Scalability, Cost, Performance, Known Weaknesses.

Read Post

Last9

Read more about Prometheus vs. Cortex

Docker Compose Logs: Guide & Best Practices

Jul 2, 2023 By Squadcast Community In Squadcast

Docker Compose is a tool for defining and running multi-container Docker applications. It allows developers to streamline the process of configuring, building, and running multiple containers as a single unit with a docker-compose.yml. This configuration file specifies the services, networks, and volumes required for an application, and their relationships and dependencies. The docker-compose logs command displays the logs of all services defined in the docker-compose.yml file.

Read Post

Squadcast

Read more about Docker Compose Logs: Guide & Best Practices

Operations | Monitoring | ITSM | DevOps | Cloud

1979, a nuclear accident and SRE

Ingest OpenTelemetry metrics with Prometheus natively

Kubernetes Monitoring Best Practices

The Medium is the Message: How to Master the Most Essential Incident Communication Channels

How we tame high cardinality in time series databases

Looking Beyond Atlassian StatusPage: The 5 Best Alternatives

OpenTelemetry for dummies: ELI5

OpenTelemetry vs. Prometheus

Breaking Down the Pillars of Observability from Data to Outcomes

Webinar: Embracing Declarative Provisioning and Observability in cloud environments

Introduction to ELK Tech Stack

Pinpoint performance issues in downstream services with the Dependency Map Navigator

Enhanced Incident Response: Maximizing Microsoft Teams with Squadcast

Take back control of your Monitoring

What is OpenTelemetry Collector

How JCB is leveraging SRE to lead a successful digital transformation

InfluxDB vs. Thanos

What Is Site Reliability Engineering? Understanding the complexities of this crucial function

Improve Visibility and Capture More Data with Triage Incidents

What Site Reliability Engineering needs - A swarm of rogue bees

Prometheus vs. VictoriaMetrics (VM)

Prometheus vs. Cortex

Docker Compose Logs: Guide & Best Practices

Monthly Archive

Follow Us