Theodo logo

Optimizing for high availability in time of crisis

April 09, 2020Nathan Gaberel7 min read

These past few weeks have been tough for many websites. Some of them find themselves with an unprecedented increase in traffic (e.g. video conference software, online grocery stores, social media) and others with a dangerous reduction of users (e.g. car insurance, travel agencies, etc).

Although both of those scenarios present severe risks for businesses, the second one often cannot be solved by technology alone. So the goal of this article is to help with the former and provide a set of quick actions to maintain a highly available website on AWS under unexpectedly high traffic.

Most of the time in this situation, getting our website back online as soon as possible will be our top priority, which doesn't leave much time for writing or changing code (and would we want to add untested code into the situation anyway?). It also means that we're prepared to spend more on our infrastructure to accommodate the increased traffic (which may well pay for itself if that helps generate more business). For each recommendation I'll try to give a sense of the price increase involved.

Some of the actions below are pretty generic (only assume a standard 2- or 3-tier web infrastructure) and some are more context dependent. We'll cover the generic actions first and then explain in which context the specific actions apply.

Generic actions

1. Increase instance size

Time: < 1h
Money: from ×2 to ×100 on EC2 bill (you choose)
Effectiveness: increased capacity for cpu- or network-bound websites

This is the simplest and quickest action, although it can be a little bit expensive depending on the new instance type we choose.

Keep in mind that increasing instance size does require replacing all the currently running ec2 instances (we can't increase the size of an existing instance), so this migration requires some planning. In particular, if our application is stateful, we risk losing data so our best option is probably to take snapshots of EBS volumes and recreate them for the new instances. In most cases however all that's needed is a redeployment of the app on the new instance(s).

To save on ec2 costs remember to use reserved instances when possible.

2. Use autoscaling

Time: 2h - 6h
Money: from ×0.1 to ×10 on EC2 bill depending on autoscaling rules
Effectiveness: lifts all cpu- or network-bound blockers for stateless apps

This is a very good option if our app is already stateless.

Autoscaling will automate the creation and termination of EC2 instances to match the fleet size to the current traffic. This will allow us to not over-provision instances to serve our users and save a lot of money. Remember that we can set a max number of instances in our fleet so we can stay in control of billing.

If our app isn't stateless already, we may want to consider focusing development efforts to make it so in the short term. Although it's likely to require substantial change to the code, it's also going to unlock big performance improvements through parallelization (and autoscaling) so it might be worth the trade-off. If however we decide that making our application stateless isn't for us, there still are still things we can do to unlock performance, especially option 6.

3. Deploy to multiple regions (active/active sites)

Time: 0.5 day (with IAC, prepared) to 3 days (without IAC, unprepared)
Money: ×2 for the entire AWS bill
Effectiveness: doubled capacity

Another way to keep response times low worldwide and to massively increase our website's resilience is to deploy it in multiple regions, both serving traffic (also called active/active deployment). By this point we should already be using multiple availability zones in the first region and, by deploying a copy of the site to a second region, we will get:

  • the capacity to serve double the normal number of users ;
  • a site that's resilient against region-wide disasters.

Infrastructure as code (or IAC, with e.g. Terraform or CloudFormation) will make this deployment much easier and faster. If we don't use such a solution already, I wouldn't recommend setting one up just now, as it usually takes longer to prepare an IAC script than to provision an environment manually through the console. However we should definitely add this to our backlog for when things start to calm down.

Once we've got our second region running, we'll need to split the traffic between the sites. This can be done with Route53 which has several routing policy types to handle this, including latency-based or geolocation routing.

Finally we probably need to use a "single" database for both regions. One way to achieve this is to create a read-replica of the existing RDS database in the second region (with async replication). Point all database access to the master database in the first region and in case of disaster in zone 1, we can promote the replica (AWS replaces the dns automatically, no code or configuration changes required).

Specific actions

4. Cache static assets

Time: 0.5 to 1 day
Money: depends on asset sizes and volume downloaded, see calculator
Effectiveness: removes static files load on web servers

This is useful if web servers take time to serve each and every request for static assets (e.g. images, CSS and JS scripts, static html) or, even worse, re-generate them each time. In that case a great way to reduce the load on the servers is to have AWS serve the static assets for us.

A standard solution to this problem is to store assets in S3 and serve them through AWS's CDN CloudFront. Using CloudFront enables caching of resources and uses edge locations around the world to ensure fast response everywhere, however it becomes expensive with many users. Use AWS's simple monthly calculator (the newer AWS pricing calculator doesn't support CloudFront yet) to estimate how much it would cost.

5. Cache long-running database queries and computed value

Time: 1h to 1 day depending on web framework
Money: 15 USD/month (with instance reservation) to 20k+ USD/month depending on requirements
Effectiveness: removes expensive computations from web servers

It's an almost unavoidable issue on dynamic websites with growing data: unoptimized database queries start taking longer, eventually blocking worker processes for several seconds at a time and seriously limit our ability to scale the number of concurrent users. While it's important to optimize database queries as much as possible, there's a simple way around this issue: caching query results.

A typical way to implement application-level caching on AWS is ElastiCache. We'll pick between a Redis and a Memcached backend based on what's easiest to implement in our application code and use the cached value as often as is reasonable.

The same reasoning (and solution) applies for computed value that are not user-dependent and don't need to be recomputed on each request.

6. Store sessions data properly

Time: 1h to 1 day depending on web framework
Money: 10 USD/month (with instance reservation) to 1k+ USD/month depending on number of users
Effectiveness: removes lock acquiring bottleneck

In case our current application has to remain stateless and we're storing session information on the servers, make sure to store the session data in a way that allows concurrent access to the session store.

For example, in some implementations (especially in PHP), storing sessions in a single file can lead workers processes to waste a long time acquiring the session file lock on every request. An alternative to that is to store session data in memory, for example using ElasticCache on AWS (see action 5).

7. Rewrite bottleneck services

Time: depends on service
Money: however much dev time worth of money
Effectiveness: removed application level bottleneck

This one is for last resort, when no other options are obviously available. Once we've identified (one of) the subsystems that are responsible for performance bottleneck(s) (for example using AWS X-Ray, New Relic or plain logs), it's time to consider whether a rewrite would yield significant improvements.

In this sort of situation it's beneficial to start relying on cloud services to be able to build quickly and cleanly. AWS services like SQS (for decoupling), AWS Chatbot, Elastic Transcoder, Lambda, Amazon Polly and Kinesis for example are reliable building blocks to build solid, serverless workflows for chatbots, video transcoding, natural language processing or event streaming quickly. And there are many similar services we can use to build extremely scalable applications.


Need more help? I'd like to help you with your scalability issues! Contact me by email or on twitter and we'll try to find a solution that's right for you. 🙂
Nathan Gaberel

Nathan Gaberel

Architect developer