Datadog Gold Partner logo

My experience on Monitoring using Datadog

By Arun Rajeevan.Sep 17, 2022

I have been working in the Cloud based Application development for more than 7 years and I can across lot of different monitoring tools like Kibana, Splunk etc.

Initially I have been very reluctant to use a Centralized(Single) monitoring tool (One for all) as I used to think one solution cannot have all the nitty gritty details that you need for debugging.
Recently I started using Datadog in my dat to day activities and I really loved some features of it which I would want other Engineers to try out to debug a complex application.
In my previous company, I have been using different tools for monitoring.(Sentry for UI, LogRocket for UI performance, Cloudwatch for server logs, Apollo Studio for Graphql errors and performance).
Current company already had Datadog and I started to explore what Datadog can provide to me which earlier tools couldn’t provide.

Below are those points which I felt are the best part of using Datadog.

1. End to end traceability — You can trace a request originating from your UI component to all the backend services that particular request has invoked and their logs, response time, outgoing http calls from those backend services, etc. This was a game changer for our application where I don’t need to switch between different monitoring tools to understand what’s happening during an API invocation.
Earlier I need to have a correlation Id and search it in the backend logs to see the related logs. But with Datadog, I could see all the logs that belong to a particular client request. I could also the the dependent services and how much time each service took in order to serve a particular request.

2. Context Info can be added along the trace — We had the below requirement:
When we trace the request, we wanted to know which user has requested this and details like email id or username.
We could easily do it by using Datadog SDK using the span and baggage concept. Basically you can any info as Baggage to a particular request at your Gateway layer and then you can get this information in your trace.

3. Session Replay using Session Id or email Id — Every UI logs in Datadog has session Id and if you have enabled session replays of Datadog, then you can search a user session using session id or email id and look at user’s behavior or actions performed during his/her session. If any user complaints about a problem in Prod env, you can actually ask the time it happened and filter the sessions using the email id or you can search the error he/she faced and then copy the session id from it and then come to session replays and search the user session using session Id.
This was again a game changer as we could relate logs to a user session.

4. Monitoring of external APIs — This was completely new for me as I never thought I could monitor the performance of external APIs which are used in the Application.
In Datadog, you can see if you have used an external API, what were the responses and how much time it took and also get the count of requests being made during a time interval.
We were using one expensive external API and we wanted to know which part of our application is using it. We used the APM feature and searched for the URL and saw that HTTP response code, time taken to complete a request and also we can see which service has invoked it and go to the request trace.

5. Single UI view to search all the logs with filters — The most annoying part of Cloudwatch is filtering of logs. You need to go to Log insights and understand the filtering language and select the log groups etc etc. it’s an overkill for a developer who may not be knowing lot of things at start and wants to debug a problem. With Datadog, you simply come and select which service you want to search for, and put the search string in double quotes and adjust the time interval and that’s it. You get all the logs and you can apply more filters like session Id, Browser, O.S (for UI logs). This feature is really awesome and it really helps you in debugging faster.

6. Debug a problem using User session Replay — Once you replay a user session, you can see the UI events like click, scroll, focus etc. On each event there is a time associated with it. If anyone complains about a problem, you directly go to that particular time and play it and see the UI events. With each UI Event, you can see the details like Network calls, console logs etc and also see their response time and HTTP code.
This can help in looking at any UI errors, console errors and also check if there was any API being too slow.

7. Time zone settings — Our application is heavily used in US and our Engineering team sits in India. So we wanted to debug issues happened to a user in US and therefore had to filter the time based on U.S time or UTC time. Datadog has the setting to change the time zone and filter out the logs accordingly.

8. Errors happened in a particular API — It is very easy to look at How many errors happened in a particular API and see what are those errors. Again this was an amazing feature where our Scrum master/ Product Owners can also go to Datadog and see what’s happening in the application technically.

Cons I felt:

1.LogRocket has more features in terms of session replay where you can see the req body and headers, the memory graph etc. It gives you a feeling that you are inspecting the UI using the dev tools. I couldn’t see that in Datadog. I desperately wanted to see the request body and memory graph at client device to debug the performance issue.


The original article published on Medium.

Related Posts