Applying Observability - In Conversation with Nitzan Shapira of Epsagon

By Gerald Curley

Jun 23, 2020

11 minutes

OpsMatters

Today we hear from Nitzan Shapira, Epsagon CEO, about the latest exciting Epsagon product announcements, how Applied Observability is needed to make sense of ever more complex applications and environments, and how the role of Ops and Monitoring is evolving as part of the new "Cloud 2.0".

Nitzan founded Epsagon in 2017 with Ran Ribenzaft, Epsagon CTO. Nitzan has over 15 years in R&D and management and previously served with the Israeli Elite Intelligence Unit. He holds bachelor and master of science degrees from Technion - Israel Institute of Technology.

His experience spans programming, machine learning, cyber-security, and reverse engineering. Personally, he enjoys playing the piano, loves traveling, and is an experienced chess player.

How has the recent pandemic and quarantine affected you personally? It must have been quite difficult trying to get a major new release out in the middle of what's been going on in the world.

You know what needs to be done in the company, so nothing really stopped during COVID. Definitely not Product Development or Marketing, and also not Sales or the Business. We moved to working from home for the last three months. We were already used to working from Zoom, both internally and with our customers. We have a team in Israel, and in New York and San Francisco, so we had to do Zoom meetings anyway. So, yes, it wasn't trivial but it wasn't a very difficult transition compared to other companies. We just kept doing what we wanted to do.

What I've seen and heard on a lot of calls that we've done with with prospects and customers is that, APM, and specifically what we're doing for them, is seen as business critical in most cases. So we're definitely not one of the vendors that's on the chopping block, we try to position ourselves not as a nice to have but as an essential. If you're talking about application uptime and obviously cost efficiency and development efficiency that's where I think Epsagon really excels.

What were the drivers behind expanding from Serverless monitoring to Containers? Are you seeing a demand for both?

Yeah, we have seen demand for both from day one really, because people are using both in Production. It's not like one or the other, there is usually some kind of hybrid environment. We decided to focus on Serverless and Lambda when we launched the product because we saw a burning demand there. It really helped us position ourselves as a leader in that market and then attract many of our initial customers to Epsagon because there was no alternative.

And then quickly we started to offer a lot of capabilities around containers and Kubernetes. And now with the announcement of expanding to Azure, we'll broaden the offering to capture what modern applications really look like today. We're not shifting the focus from Applications, Monitoring and Distributed Tracing. We're just addressing the needs out there, those kinds of hybrid environments. So the transition has been very natural really. We got to the point where we needed to expand to another Cloud provider, which is, of course, a big move, because we're seeing demand for both Cloud providers, AWS and Azure, sometimes together. So it becomes a matter of "how can we cover a very large percentage of the needs of the customer out-of-the-box for containers and microservices in the Cloud?"

We're looking to cover the true environment landscape. It's a mix between containers, serverless and other services. If you just cover one type of service, that's not necessarily going to be helpful in all cases. You want to make sure that when something goes wrong the customer knows that Epsagon is the tool to rely on. But we're still not trying to do everything - we are focused on Cloud, and specifically AWS and Azure today. Technologies like Kubernetes, ECS, EKS, Fargate, Lambda, and all those technologies that are considered modern. The legacy world has very good solutions and we don't want to try and do what has been done before. We're doing something that we believe is revolutionary, but still covering what customers need and not just 60% of it.

It's also great news about your partnership with Microsoft and support for AKS. What attracted you to Microsoft/Azure as a partner for the next step in your Cloud evolution?

We've heard great things about the partnership program with Azure from different companies, both startups and larger companies. We've seen really great results working with AWS before, so we thought it made sense to announce it in parallel with an actual business partnership because, at the end of the day, Azure is probably one of the top two Cloud providers today. For some things it's first, some say it's the second, but definitely many enterprises are using Azure. We know that Azure wants to get adoption in the Cloud and also on those new technologies. Not just infrastructure, but using something like AKS, as well as the other services they offer.

We believe we can be a great partner for them and fill in the gaps for their customers. So when they tell customers "Hey, AKS is a great use case for you", the customer may may say "okay", but how do I resolve problems when they happen? That's when we'll be the partner that can help. And we want to be that leader for Azure, just like we are already doing for AWS. We feel that these Cloud providers are great partners because they are the best with infrastructure and with services. And we are able to close the gaps when it comes to monitoring the Business and with troubleshooting.

How long have the new products been in development? Were there any challenges or surprises you found during the development process when adding support for Azure?

Yeah, for sure. We've been working with Azure with these technologies for a long time. It's not something we just started doing, we've probably been doing it for over a year, since we did the first proofs of concept for Distributed Tracing in their environments. It takes a long time, but we have some of the best engineers and researchers in the world, doing reverse engineering and research. On a daily basis they did everything that was required to update our open source libraries and the backend to do the Distributed Tracing we needed and generate all the metrics from the environment. So it did require a significant development, obviously. We've been working with our design partners to make sure that everything is working properly. But it's not something we were surprised by - we are really used to doing these kind of things.

What differentiates Epsagon's container monitoring from existing solutions, such as Sysdig or Prometheus?

Those solutions are actually focused on the container or the Kubernetes environment itself, whereas we are much more focused on the application. That means, yes, you can get Prometheus metrics for your Kubernetes, but from our experience, these metrics are very shallow when it comes to really understanding what's wrong. So you might get some general information about the problem that's happening, or maybe you've only got an alert. But when you have to troubleshoot a distributed architecture, it's just not possible with only those metrics. So you have to go into logs, and you have to do investigation, and then try to correlate the logs from different services. And that can take hours or even days.

So Epsagon added another layer, which we believe is the most important layer on top of the application, with troubleshooting tools and distributed tracing. We integrate with the Kubernetes and Azure environments to generate a bunch of metrics that we believe are valuable to the user, and actually we don't even charge for it. That's what we offer for free.

We believe metrics are pretty commoditized today. When the user includes our SDK library in their code or injects it, which takes only five minutes or less, and they get that layer of tracing in their architecture so that they can really understanding business transactions. The other solutions that you mentioned are only focused on the infrastructure layer. We believe, when you go to the Cloud, the infrastructure becomes the provider's responsibility. They are best placed to make sure it's working properly. And if you think about something like AKS, you're not expected to monitor it. Even if you feel the need to do it today, in two years time you're not going to be doing it at all. You've only got to focus on your application on top of it. The other solutions you mentioned are mostly focused on the infrastructure layer, which is getting abstracted away and moving to the Cloud provider.

We've seen that Distributed Tracing is becoming kind of a buzzword. Companies think of it as something very complicated that only Netflix or Slack can actually do, needing many engineers over weeks and months to implement. We're saying that's not the case, you can do it on your own. But that's not what you're supposed to be doing with your time, you're supposed to be building business features. It was a big technical challenge to automate all this, and we have it patented. We have a lot of sophisticated instrumentation in the code to make sure performance is not impacted, and to match all the events together in the jungle of services in the Cloud. We see ourselves doing a lot of the heavy lifting for customers that are not interested in an in-house solution, which we believe is probably 95% of customers. If you look at the broader market, it's either a tech company that wants to grow quickly, or a more traditional enterprise that wants their engineers to build customer-related features rather than Observability solutions.

Kubernetes Containers Dashboard

What do you think will be the biggest differences end-users will see when using the new products? What are you hoping will be their experience?

Many of our users comment, especially when they start using Epsagon. "I only have to click a few buttons and my architecture appears!". Of course, we don't want to be seen as doing something that nobody understands, the experience should be very good, even delightful. Users are expecting something complicated; they are expecting graphs and dashboards like other infrastructure monitoring solutions, but then suddenly they see those beautiful architecture maps. And they didn't even know they could achieve that in five minutes! They didn't believe until they tried it and then they found it was really, really easy. Then suddenly they have a new source of truth for what's going on in their Production environment. That is what we are trying to achieve.

What does the new Epsagon platform mean for end users trying to gain Obervability for their deep systems and microservices? What will they see that they couldn't see before?

I think Observability has become a word that everybody uses today because things have become so complicated that you can't really monitor everything. Observability has become an easy way to say "okay, we're gonna give you all the data in the world and put it in front of you". But the reality is that having all that data is not enough - first you need to understand it, and then it can take hours or days to work out what the actual cause was. We believe that's just not good enough, maybe you've already lost hundreds of thousands of dollars and your engineers have lost productivity.

We talk about Applied Observability as a way to emphasize that Observability needs to be converted into action. We don't just give you the data, we make it very easy to figure out what went wrong. Instead of just showing you a bunch of graphs and logs with different tools to query them and expecting you to be the best investigator in the world, we believe that you shouldn't be investigating at all - let Epsagon do the work for you. So when you have a problem, you have access to one hundred percent of the transations that happened in the last week. On top of that, you have all the data and the payloads that the services sent to one another in the distributed traces. So it's like a story, when something happens, you find a specific user that reported that problem or whatever and it's all right there, so there is no investigation. You need to know the right question, you need to know what you want to find, but if you know it, you're gonna find it right away. It really takes five minutes.

We are adding another layer, Applied Observability, that provides you with the story that connects the data together, instead of having metrics, logs, traces and jumping between them. It's a different experience.

Tell me more about the new infrastructure monitoring capability. With so many other solutions available what was the reasoning behind this?

The reason is that people asked for it. We believe that after some time our customers rely less on infrastructure monitoring and logs. But we can easily have it. We are essentially hosting Prometheus inside of Epsagon for free, which is very convenient for users, and we don't charge for it. So when they identify a problem and don't have an idea what happened we suggest to put the tracing library inside the application, or inject it, so the experience is encouraging users to have Distributed Tracing in place.

Most problems are identified using the tracing library, because we usually identify problems from the application layer, but it's still good to have that layer as well. Again, I think it's a transition, in two or three years a lot of this is going to be provided by the Cloud provider, and you're not gonna need it any more. I think it's pretty standard today.

What's next for Epsagon? Any plans to further extend your monitoring offerings to other Cloud providers or explore new areas such as Serverless management?

Expanding to more types of environment for sure, more programming languages as well. We already have five languages - Node, Python, Java. .Net, and Go. We are announcing the Azure capabilities first for Node, so we can do tracing, but we're going to quickly extend to Java, .Net, and then Python and Go as well. In AWS, we are covering the five languages as well as seeing some demand for Ruby, PHP and stuff like that.

We aren't going to focus exclusively on Serverless because we don't think that's what the market needs, we need to have a more coherent solution. And generally speaking, Security is something that is more interesting to other people in the organization. Today we are really focused on providing the best experience for Engineering and DevOps. So I don't think we're going to go into Security anytime soon. Rather, we will continue to improve and extend our offerings for Engineering and DevOps to make it the best product for Microservices on Cloud.

With such phenomenal growth over the last few years it's easy to forget that Epsagon is only 3 years old. What's your secret to getting such compelling products to market so quickly?

Yeah, things move quickly. We raised pretty significant funding rounds, which obviously helped us to hire the right people and then build the product quickly. And yeah, it's been a great journey, but it's really just the beginning. We are still a pretty small company, but we want to get much bigger than that. We just opened our HQ in New York and I moved to the US recently, so I guess for me it's really just the beginning. If you look back, you can show the great growth graphs and you can tell good stories to investors, but at the end of the day it's just the beginning and we want to do much, much more.

Any advice for the budding entrepreneurs out there? Perhaps share a little about what drives you personally every day...

The most important advice is to have really good people on the team, both the co-founders and the people you bring to the team early on, and the management team. Those are the people that make the company eventually. That's my observation at this point in time. If you are thinking about starting a company, I really like it, but I don't think it's for everyone. So for the people thinking about it, I would just say "go and do it" because today is a great time. There is a lot of technology in the market, you can meet great investors to raise raise money from. And you can be up and running relatively quickly. It's not easy, but it's definitely much more possible than it was before. I think it's a great journey, I really enjoy it and I would not want to do anything else. Definitely not at this point of my life.

Lastly, the role of Ops teams has undergone massive change in the last 10 years. In your own words, why do you think Ops matters today more than ever before?

I think the tools that are available today are growing rapidly. If you think about the Cloud, 10 years ago it was only a buzzword, but now they are some of the most successful companies in the world. So this is really the future. Anyone who thinks the Cloud is not the future, they're gonna change their mind in a year or two. It's important to understand that businesses need to abstract as many layers as they can in order to be the best at what they do. And the Cloud really enables youto do that, assuming you are a software business.

All those technologies are available to you, and the company, and the developers. And we've seen that companies are utilizing them very quickly. The Cloud providers will help you, of course, because they want you to use them. But later on you may find you've launched an application into production that was more complicated that you can handle. That's when having the right tools and the right people becomes really important, to make order from the mess, and to make sure that the Business can keep working while Engineering is deploying 100 times a day, or dozens of new Microservices each week. These are things that we see all the time.

If you don't have the right foundations in place, it becomes really difficult to manage and things are gonna break. And then you'll have to restart, which is not not very efficient. So I think that the role of Ops today is not just to monitor the servers, because you can let the Cloud provider do that, but rather to bridge the gap between those capabilities that are out there for you to use, and the people using them. And, of course, to make sure that nothing breaks as you grow.

Thanks to Nitzan for taking time out of his busy schedule to share his thoughts with the OpsMatters community. You can learn more about Applied Observability for Azure and Kubernetes by reading the new blog here.