Operations | Monitoring | ITSM | DevOps | Cloud

Synced for Success: OnPage & Slack for Incident Response

As the post-pandemic world finds its footing again, a resilient spirit drives the revival, propelling businesses to embrace a new era of technological innovation. Notably, IT teams are swiftly adopting the digital transformation of their processes, particularly in incident response. From virtual collaboration tools and remote IT support to automated incident management, teams have found innovative ways to ensure seamless business continuity while delivering IT services with minimum downtimes.

Managing Prometheus cardinality in Grafana Cloud: Adaptive Metrics FAQ

One of the most talked about topics in observability today is centered around the question of how to get more value out of the ever-increasing amount of data collected by agents, collectors, scrapers, and the like. Back in May, we announced Adaptive Metrics, a new feature in Grafana Cloud that allows you to reduce the cardinality of Prometheus metrics and the overall volume and costs of your metrics.

Scaling Up to Keep Costs Down: Automation for Web Application Incident Management

Any organization that’s keeping up with today’s sharp rise in business demands (or better yet, getting ahead of the game) is doing so by getting innovative and jumping at the chance to do things differently. They’re not relying on the old ways or trying to use their existing toolbox. Instead, organizations are looking to the newest technologies and means of adding efficiency to as many day-to-day functions as possible.

Reliability Best Practices: How Gremlin Uses Gremlin

Ensuring software availability is essential for any SaaS company—including Gremlin. To do that, our teams need to identify the reliability risks hiding in our systems. That’s why our development, platform, and SRE teams use Gremlin regularly to perform Chaos Engineering experiments, run reliability tests, and track the reliability of our systems against our standards. Along the way they’ve picked up a thing or two about how to find and fix reliability risks with Gremlin.

How to monitor connector's API Connections in Logic Apps?

Let us consider a scenario where a Logic App is used to communicate with SharePoint through API connections, known as connectors. When configuring the connector, it communicates with Azure AD, retrieving a username and password and continuously refreshing the authentication token. When the Logic App calls the connector, it performs operations like uploading files to SharePoint.

4 Node.js Logging libraries which make sophisticated logging simpler

Node.js logging, like any form of software instrumentation, isn’t an easy thing to get right. It takes time, effort, and a willingness to continue to iterate until a proper balance is struck. There are so many points to consider, including: Previously, here on the Loggly blog, I began exploring these questions in the context of three of the most popular web development languages: PHP, Python, and Ruby. But these aren’t the only popular languages in use today.

7 Tips for Remote Data Center Management

As data centers become increasingly decentralized, managing them remotely is now a must-have skill. Data center professionals need to maintain uptime, increase efficiency, and boost productivity across all their global sites without leaving their desk. While this might have once seemed near-impossible, with the right tools and processes, remotely managing your data center can be even better than physically being there.

Rethinking Observability with MinIO and CloudFabrix

While the growth trajectory for data in general is extraordinary, it is the growth of log files that really stand out. As the heartbeat of digital enterprise, these files contain a remarkable amount of intelligence – across a stunning range, from security to customer behavior to operational performance. The growth of log files, however, presents particular challenges for the enterprise. They are not “readable” per se, they require machine intelligence.

Load Testing vs. Performance Testing vs. Stress Testing

Just conducting one type of testing is generally not enough. For example, let’s say you decide to perform unit testing only. However, unit tests only verify business logic. Many other types of tests exist to verify the integration between components, such as integration tests. But what if you want to measure the maximum performance of your application? Or what if you want to know how the application behaves under extreme stress?