Site Reliability Engineering, Observability, and the Tradeoffs of Modern Software
This blog post defines SRE by explaining SLOs and error budgets, highlighting the innovation vs. reliability tradeoff.
This blog post defines SRE by explaining SLOs and error budgets, highlighting the innovation vs. reliability tradeoff.
One more “ops” phoneme like DevOps is ChatOps; or conversation-based development/operations. ChatOps has been growing in popularity as communication platforms such as Slack is ingrained in our day-to-day engineering lives. A team lead once told me “if it didn’t happen in Slack, it didn’t happen” showing the emphasis of communication platforms as a system of record.
As an update to.conf’s announcement of our continuous code profiling preview, we’re excited to share that today Splunk APM’s AlwaysOn Profiling is generally available for Java applications, included in APM with no additional cost. Here’s a quick walkthrough of the feature, and how you can get started now.
With Kubernetes emerging as a strong choice for container orchestration for many organizations, monitoring in Kubernetes environments is essential to application performance. Kubernetes allows developers to develop applications using distributed microservices introducing new challenges not present with traditional monolithic environments. Understanding your microservices environment requires understanding how requests traverse between different layers of the stack and across multiple services.
If you are trying to compare all of the best solutions for application performance monitoring and management you may have found that it can be highly complicated to compare all of the available observability tools whilst also trying to keep within a reasonable budget.
Innovation is at the core of everything we do at ServiceNow. We’re always striving to make work better for our customers, our partners, and our employees. That’s why we’re humbled and honored to be named for the fifth straight year to the 2021 Fortune Future 50 list, which recognizes innovative companies with high long-term growth potential. ServiceNow is committed to becoming the defining enterprise software company of the 21st century.
Telegraf comes included with over 200+ input plugins that collect metrics and events from a comprehensive list of sources. While these plugins cover a large number of use cases, Telegraf provides another mechanism to give users the power to meet nearly any use case: the Exec and Execd input plugins. These plugins allow users to collect metrics and events from custom commands and sources determined by the user.
A common DevOps use case involves alerting when hosts stop reporting metrics, aka a deadman alert. This can be done using the monitor.deadman() Flux function. One can easily create a deadman (or threshold) check in the InfluxDB UI Alerts section or craft a custom task to alert as well. Check out InfluxDB’s Checks and Notifications system post for more details. It’s also possible to use the monitor.deadman() function directly in a dashboard cell.
Our December update brings a ‘Who is on duty’ board displaying current team members on duty with contact information. In addition, we have simplified the manual sending of Signls and improved the integration with Azure Sentinel. As always, you can find all the details in this article.
I can’t remember the last time I drove down highway 101 between San Francisco and the South Bay and didn’t see a billboard claiming to be the single tool to solve all of my data problems.