Operations | Monitoring | ITSM | DevOps | Cloud

Squadcast

Hrushikesh shares his journey into SRE and his thoughts on the future of this space

Hrushikesh is passionate about making a complex design with simple and reliable solutions. He is technology and platform agnostic and doesn’t believe in limiting himself to just a few. He started his career in 2006 with a Media company where he was responsible for introducing new technologies along with driving a team to deliver quickly. He does not limit his role to just development and operations and loves exploring everything in the tech space.

Better Incident Response: Incident Classification & Setting Severities with Tags

What you absolutely must know when responding to an incident is what kind of impact it has on customers and how negatively it can affect your team. This is typically addressed by following some kind of incident classification, usually “incident severity levels”, to indicate the importance of every incident - that is, to understand how seriously various stakeholders are affected and to route the incident differently if necessary.

Scheduling IT and Engineering on-call rotations just got easier

It shouldn’t take you more time than a few seconds to understand your on-call schedule and rotations and how you could make changes to it. It is important for on-call scheduling and alerting tools to make this as simple as possible. If you’re spending more than a few seconds to understand what your on-call rotations are going to be like for the next day or week or month, then you need to start looking for a better on-call management tool.

Hiteshwar shares his thoughts on being an SRE

Hiteshwar is an SRE based out of Mumbai, India. His area of specialization is in distributed systems. He works on Kubernetes, running his own custom clusters, maintaining them and creating tools to manage and monitor them. He likes to share his learnings by writing articles and blogs on Medium and Linkedin. He is an active speaker in meetups and developer groups and also teaches DevOps and SRE practices at learning centers.

What you can show on your status page

When something goes down, the first thing a customer does is check if there is something wrong with their systems or if it is an issue with one of their service providers. So it’s important to make sure that your status page has all the information that is needed where they don’t feel the need to raise an issue or create a ticket, adding to your support costs.

Using a Status Page in your Incident response process

A status page is a communication tool that allows you to display the current working status of your various services - whether fully functional, partially degraded, severely affected, etc. The nomenclature of the service status can be defined by you. On the status page, you can also access & update the uptime and incident history data for all your internal facing or customer impacting components.

Making Observability Actionable at Scale - Sisir Koppaka | DBS DevConnect 2019

Many organisations already possess a vast amount of existing data about production systems. As customer expectations evolve, organisations are often challenged to find more proactive ways of dealing with traditionally reactive incident response activity. In this talk, we discuss approaches to unlock value from this data by making it truly actionable.