How to safely deploy new features without breaking your production

Deploying new features in production can often be frightening. It’s difficult to test a new feature as thoroughly as desired in a staging environment, especially if there is a large pool of users to your application. There is always the fear of missing some tricky edge cases that may occur in production, resulting in a bad user experience.

I used to work on a project whose goal was to digitalize administrative procedures. As part of our project workflow, users’ identity documents needed to be authenticated by calling an external API, let’s call it VerifyMeNow.

The problem was that this API wasn’t working very well and often blocked many of our users by returning server errors, requiring manual interventions on our side to investigate each problem and unblock each user. This resulted in a loss of trust with the API provider. Unfortunately, we had to continue using this provider due to business reasons, and switching to a different provider was not an option.

One day, the VerifyMeNow team came to tell us that they had just deployed a V2 of their API that was supposed to solve many issues we had often encountered so far. But given that they had already made solving promises they didn’t held, we were afraid that migrating all of our traffic to V2 at once could potentially cause major disruptions and blockages for our users.

Roman soldier from Asterix complaining about migration not going as expected

To ensure that the new version was functioning properly before fully migrating our traffic, we needed to test it beforehand. One solution could have been to spend several weeks rigorously testing their new version to ensure it functions correctly (or even to recruit a QA team to test every possible scenario in an ideal world). But we could never have spent enough time on it to be 100% confident on the migration and given our wide number of users a lot of edge cases could be left untested and actually happen in production.

However, another solution came to mind: why not start by migrating a small portion of our users to V2, monitor their experience, and gradually migrate more users over time if everything went smoothly?

Canary Release as a gradual deployment solution

Canary Release is a deployment strategy that allows a new feature to be used on a subset of users to test its proper functioning before migrating all users to it. This also allows for comparing analytics between the new and old feature (comparing NPS, user abandonment rates, evaluating their success on the new feature and their behaviors).

Implementing this strategy will result in several modifications of the proportion of users accessing the new functionality, whether decreasing the proportion if issues are encountered or increasing it otherwise. One of the advantages of this strategy is that there is generally no need to modify the code, and therefore no need to launch a new deployment.

A good practice to ensure this is to have all properties related to the Canary Release simply in the database, for example, so that this data is read at execution and a modification would be very inexpensive. This allows us to have a very flexible integration of new features, like a feature flag. With 2 major benefits:

Being able to activate/deactivate the feature as needed according to the results of this integration.
Being able to gradually increase the proportion of affected users in a case where the integration goes well and we want to eventually democratize it to all users.

Example : Suppose you’re working for an online car retailer. We want to secure the checkout on cars, we are already using mobile confirmation but we want to begin to replace it by VerifyMeNow to make it more secure. In this case, a Canary Release strategy would be a great fit. You could deploy the new version of the identity verification process to a small subset of users, monitor their behavior and the performance of the new feature, and quickly roll back if any issue arises. If the new feature performs well, you could gradually roll it out to more users until it is released to the entire user base.

Coming back to my problem, our typical use-case would be as follows: users connect to our application, fill out their administrative request, upload the requested documents, and submit it. The validation process is asynchronous, and users return several days later to see whether their application was accepted. During this asynchronous verification, the VerifyMeNow API is called.

Given this, releasing the feature to a defined subset of users wouldn’t be very useful since the difference between both versions is not visible to them. Analyzing their behavior wouldn’t make much sense.

In conclusion, a Canary Release could have worked, but it might have been a bit over-engineered.

Canary Deployment, an alternative solution

The Canary Deployment strategy is very similar to the Canary Release, with the main difference being the mode of application. Unlike the previous strategy, this one does not apply to a defined subgroup of users but simply to a given proportion of them (or Traffic Splitting). This solution can often be more appropriate and simpler to implement for changes at the back-end level or any changes where the user will not directly see a difference. Indeed, in this type of case, having feedback from the user would not provide any additional information, internal monitoring coupled with this solution would be sufficient to evaluate its effectiveness (performance, stability, …)

example: Suppose you are the lead developer on a microservices-based architecture that consists of multiple APIs and services. Your team has developed a new version of one of the APIs that uses a different database schema. In this case, you can use a Canary Deployment strategy to test the new version in a random subset of traffic before releasing it to production. You can use traffic splitting to direct a small portion of traffic to the new version and monitor its performance and reliability. If the new version performs well, you can gradually increase the proportion of traffic directed to it until it is fully rolled out to all users.

This solution is more appropriate in my case because as explained just before, the migration of the API involving no differences for the user experience, it would not add any value to release it to a specific subset of users. Applying the migration to only a given proportion would be enough in our case.

The difference with other deployment methods

A/B Testing

The A/B Testing strategy is quite similar in principle: offering two different solutions for the same use-case to its users and comparing the effectiveness of the two solutions.

The difference lies more in the use cases: Canary Deployment will be used to test and validate a new feature on a progressive release, while A/B Testing is used to compare different versions of the same product to select the most optimal one.

Performance and Continuous Availability Solutions

The Rolling Deployment strategy consists in gradually deploying a new version of an application or feature on an infrastructure, server by server (or cluster by cluster). This strategy helps limit downtime or performance drops but is more focused on managing the application’s infrastructure than user experience, so it is not a real alternative to the previous solutions.
In Blue-Green Deployment you have two production environments that overlaps at the same time, allowing a new version (green) to be deployed without impacting the user, validating that everything went well, and then switching all existing traffic (currently on the *blue)* without any downtime for users. It also allows for easy re-switching to the old version, still without any downtime, if there are issues with the new version.
The Feature toggle is a popular strategy for deploying new features gradually. With this approach, you can deploy a new feature in production while keeping it inaccessible to users. Once the feature is ready, you can enable the associated Feature Flag to make it available to users. This gives you the flexibility to disable the new feature easily if you encounter any problems..

After all, we chose to implement a Canary Deployment strategy for our migration. Another thing was that on our project we had a workflow engine, Camunda, that we used to monitor the good behavior of our client’s applications through our workflow. So we decided to make this Canary Deployment implementation visible on this workflow engine to allow an easier monitoring of its good behavior.

Monitoring your release with a workflow engine

As mentioned earlier, the main strength of these release strategies is to have a flexibility to gradually release a new feature and monitor its effects.

This can become even more useful if you have a workflow engine plugged in your project such as Camunda, AWS Step Function or any custom solution.

Why ? Because it’ll allow you to visually follow your migration by monitoring how many current instances you have running on both versions, as well as monitor how many incidents are happening inside both of them, … and it’ll also quickly detect any potential issue happening.

If we come back to my example of migration from a V1 to a V2 of an API, implementing a Canary Deployment strategy and using a workflow engine (Camunda in my case), we could have a workflow like the following:

Camunda BPMN Diagram associated to the workflow

Our Choose which version to use brick would simply route the given proportion of traffic to V2, and would look something like this in pseudo-code (the actual one would depend on various factors such as your workflow engine and where and how did you stored your deployment rate):

// Retrieve the defined rate of users supposed to use v2
double newVersionRate = retrieveCanaryDeploymentRate(...);

// Calculate which version current user will be routed to
boolean useNewVersion = Math.random() < newVersionRate;

// Set a variable (specific to your workflow engine)
setWorkflowVariable('useVersion', useNewVersion ? 2 : 1);

This would allow you to easily visually monitor how many current instances you have running on both versions, monitor how many incidents are happening inside both of them and even add specific error handlers if needed.

Conclusion

Deploying new features in production can be risky, especially when working with third-party APIs that may not function as expected. The Canary Release / Deployment strategies mitigate this problem by allowing a new feature to be tested on a subset of users before migrating it to all users. This allows for better monitoring and evaluation of the new feature’s performance, and enables quick rollbacks should any issues arise. You can also minimize testing time and reduce the need for extensive pre-deployment testing of your application

While this solution may be considered over-engineered in some cases, the investment required for implementing these strategies is not that significant, as the process itself is quite straightforward.

In my case, we used Canary Deployment coupled with a workflow engine to gradually migrate all our traffic to V2 of the VerifyMeNow API. This allowed us to safely monitor its progress.

In the end, everything went smoothly, so we probably could have migrated everything all at once. However, using this approach made us much more confident about the process and saved us time on tests.

Adding these guards was almost negligible in terms of cost, as it did not require much extra code. Therefore, I would definitely use this approach again, even though we did not encounter any problems.