Skip to content
Logo Theodo

How to improve your monitoring and alerting on Sentry

Quentin Scabello9 min read

Sentry-logo

In February of last year, I began my first project to improve a banking application. This project had a common code base used by about 10 teams, as well as external dependencies. My team and I developed a new subscription path for users. It was crucial for our client that the application was functioning properly, in order to avoid launching a subscription with bugs.

One effective way to monitor and assess the health of our application is to use Sentry, an error-monitoring tool. When an error occurs, Sentry captures it and records information such as the type of error, when it occurred, and associated contextual data. This helps developers to quickly identify and fix errors in the application.

Upon arriving, the team and I were unfamiliar with Sentry, so we had to learn about it and get to grips with its functionality. In the following months, I focused on two main areas of improvement: grouping errors and monitoring them. In this article, I will present the various strategies and techniques I used to monitor and analyze errors, in order to ensure the stability and quality of the application.

Manual grouping to resolve common errors

When an error is sent to Sentry, a hash is calculated based on the call stack, the error message, or some data about the user. If two errors are similar, they will have the same hash and be grouped into one error to avoid duplicates and overloading our error list.

event-grouping

If an error returns a custom error message based on an object (DTO, entity, date, …) or if a transaction got a path parameter, Sentry will not always group the two errors. bad-grouping However, there are three possible methods to configure Sentry to do this:

We dedicated one day of our week to cleaning up our Sentry dashboard. We manually grouped errors using fingerprints, ignored irrelevant errors, and deleted errors that had already been resolved. This work allowed us to do a major cleanup of the application’s Sentry errors.

However, this grouping was done manually at a specific point in time without any particular strategy for the given errors. As a result, it will not be effective for new routes or microservices, as it does not cover all possible errors. We have estimated the number of errors in the application, and our goal is now to reduce this number by solving the most frequent errors one at a time.

Daily monitoring to improve the status of a micro-service

Initially, we aimed to briefly review each new issue originating from our micro-service to avoid missing anything. Every day, we analyzed and prioritized new errors based on their severity. To facilitate their resolution when they reappeared, we documented the steps taken to resolve them. We spent approximately 40 minutes per day on this analysis. But this system has several limitations:

These problems were unpleasant for the team, which didn’t feel it was making progress. We decided to abandon this system.

Real-time alerting on Slack

To monitor our customers’ journeys on our microservices, we use the Alerts tab in Sentry. This tool links Sentry to Slack channels, allowing you to create alerts that send messages to specific channels. If a new error occurs, or if an existing error occurs multiple times, a message with error details is sent to the Slack channel. alert To create an alert room with Slack integration, Follow the Sentry documentation to learn how to create an alert room with Slack integration,

For each of our micro-services, we have created two different alerting rooms:

The unassigned channel contains all the errors of the micro-service. Since several teams work on certain micro-services, this channel does not filter out errors, and new ones continue to arrive in this room. Once the error is analyzed, it is assigned to one of the teams working on this micro-service. Any new occurrences of the issue will be directed to the assigned channel, which is dedicated to errors assigned to the team.

We begin by noting the number of alerts we receive each day. This helps us identify the microservices that are causing the most errors, allowing us to focus our efforts on the most critical areas. Additionally, the Sentry alerting page provides a clear view of the various alerts during a given period. This allows us to easily identify the most frequent errors, as well as any peaks that may be indicative of production incidents.

alerting

Better grouping to remove noise

The method of using Slack is rather effective in improving the reactivity to bugs. Additionally, we have already detected several bugs in production thanks to this alerting. However, the problem of error clustering still persists. Several times, we have had an error that was badly grouped, producing a lot of noise and many duplicate alerts. This problem requires more attention.

According to Sentry, there are 3 good ways to group:

In our case, grouping by server or geographical area does not make sense, as our application is mainly used in France. So the first two options remain.

Grouping by root cause is easy to set up since Sentry naturally groups errors with the same root cause. This is because the fingerprint is calculated from the error message and the stack trace. We can also create rules to group errors by transaction or error message similarity. Therefore, I began grouping errors by transaction and error message for easier management.

However, it is essential to have a comprehensive understanding of the entire application. Many transactions that appear dissimilar at first glance may actually point to the same root cause. To group issues together, rules can be added to error messages, but if the error message changes, the grouping may break.

Sentry has a feature that suggests which people or team should be responsible for an error based on the commit that introduced it. This feature is very handy because it can save a lot of time. When working on a shared front end with other teams, this feature eliminates the need to assign errors between teams. Instead, we can focus on solving errors, without being bothered by errors that do not concern us.

However, there are two problems:

Increase responsiveness and efficiency by monitoring critical paths in a micro-service

After further grouping by root cause, it was discovered that an error was producing a lot of noise, making it difficult to see errors on crucial routes. To separate these noise-producing errors from the other more critical errors, several methods can be used:

Conclusion

Over the past 12 months, two distinct systems have been tested and implemented to monitor and process new errors on Sentry:

There are still a few areas for improvement in this last method:

Liked this article?