Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Podcast: Break Things on Purpose | Gustavo Franco, Senior Engineering Manager at VMWare

In this episode Jason is joined by Gustavo Franco, Senior Engineering Manager at VMWare, to chat about chaos in the Gustavo’s early days. Gustavo reflects on Googles early disaster recovery practices, to the contemporary SRE movement.

How they SRE: Insights from the Cloudflare SRE team

Cloudflare is a global cloud services provider that is based all over the globe, from San Francisco, US to London, England to Sydney, Australia. Their mission, as stated front and center on their homepage, is to help build a better Internet. While that may read like hyperbole, their numbers are impressive - Cloudflare has over 126,000 paying customers and 95% of Internet Users in the developed world are within 50ms of their network.

Site Reliability Engineer (SRE) Roles and Responsibilities

Software development is getting faster and more complex – frustrating IT operations teams more than ever. So, DevOps gained popularity in order to combat siloed workflows, decreased collaboration and a lack of visibility. While establishing a culture of DevOps has helped teams collaborate better and deliver reliable software faster, DevOps teams don’t necessarily have someone specifically dedicated to developing systems that increase site reliability and performance.

How Changelog monitors and optimizes website performance with Grafana Cloud

Developers around the world get their news from Changelog, an indie media company on a mission to create inspiring content for software developers. Through their popular podcasts, including The Changelog, Go Time, JS Party, and Ship It!, the team at Changelog helps listeners stay up-to-date on the latest happenings, trends, and tools in a constantly evolving industry.

How We Use Sloth to do SLO Monitoring and Alerting with Prometheus

One of the most challenging tasks for Site Reliability Engineers is to align the reliability of the systems with the business goals. There is a constant battle between delivering more features—which increases the product’s value—and keeping the system reliable and maintainable. A significant ally to achieve both objectives is the Service Level Objective Framework.

Differences between Site Reliability Engineer Vs. Software Engineer Vs. Cloud Engineer Vs. DevOps Engineer

The evolution of Software Engineering over the last decade has lead to the emergence of numerous job roles. So how different is a Software Engineer, DevOps Engineer, Site Reliability Engineer and a Cloud Engineer from each other? In this blog, we drill down and compare the differences between these roles and their functions.

SRE vs. DevOps: What Are the Differences and How Can They Work Together?

The growing importance of technology in business success has forced practically all companies to hire competent, experienced IT professionals. As technology ecosystems become increasingly complex, organizations need a broader range of professionals to focus on tasks like product development, troubleshooting, and customer services. SRE and DevOps have emerged as two of the most critical approaches to success.

Top 13 Site Reliability Engineer (SRE) Tools

The role and responsibilities of a site reliability engineer (SRE) may vary depending on the size of the organization. For the most part, a site reliability engineer is focused on multiple tasks and projects at one time, so for most SREs, the various tools they use reflect their eve-evolving responsibilities. A typical SRE is busy automating, cleaning up code, upgrading servers, and continually monitoring dashboards for performance, etc., so they are going to see more tools in that toolbelt.