Rapidly recover from application failures in a single AZ | Networking & Content Delivery

https://aws.amazon.com/blogs/networking-and-content-delivery/rapidly-recover-from-application-failures-in-a-single-az/

Today we would like to introduce “zonal shift”, a new capability of Amazon Route 53 Application Recovery Controller (ARC) that is built-into Amazon Elastic Load Balancers (Amazon ELBs) and available in preview. Performing a zonal shift enables you to achieve rapid recovery from application failures within a single Availability Zone (AZ).

In this post, we’ll explain how zonal shifts work and how they fit into an overall reliability strategy for a highly resilient multi-AZ application that uses features such as load balancer health checks. You can start using zonal shifts in preview today within the US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (Frankfurt), Europe (Stockholm), Asia Pacific (Tokyo), Asia Pacific (Sydney), and Asia Pacific (Jakarta) Regions. You can use zonal shift with ALBs and NLBs that have cross-zone load balancing turned off. There’s no additional charge for using zonal shift with ALB and NLB.

Building fault tolerant services using AZs

A key strategy for designing reliable systems, adopted by AWS services and customers operating highly resilient applications on AWS, is use of multiple, independent “replicas” and plan for the failure of any one replica at a time. In this strategy, you build the overall system as multiple (commonly three) application replicas and plan for the failure of any one replica at a time. You must then provision sufficient capacity in each replica to handle the load should one replica be offline temporarily. You next work to ensure all common failure modes (e.g., bad deployment, response latency too high, elevated error rates) are contained within one replica (one “fault container”). Then, should a replica fail, you can temporarily remove it from the system to restore normal service for your customers. Once normal service is restored, you can investigate and repair the failing replica. Failures can come from various sources, including software deployments, operator errors, hardware failures, power failures, network device failures, certificate expiry, and even data center failures. By working to make sure that failures are rare, contained to one replica, and can be recovered rapidly, we’ve found we can operate more reliable systems.

This strategy is “recovery-oriented,” meaning that it prioritizes recovery first over investigation and repair. You first recover the application to a healthy state by removing the failing replica. Then, you can investigate the root cause and repair the failing replica, before returning it to service. Making sure that you can recover first before determining root cause reduces our Mean Time to Recovery (MTTR), thereby lowering the duration of impact on customers.

A critical input to this strategy is to minimize the chance that any two replicas fail at the same time or in a coordinated fashion. To do this, you must make sure that your replicas operate as independently as possible. This typically involves a series of measures, such as deploying software to only one replica at a time, making changes to only one replica at a time, and diversifying (or “[jittering](http://Available now ARC zonal shift is available now for ALBs and NLBs which have cross-zone load balancing turned off, in the AWS Regions listed above. We’ll add support for more Regions and load balancer configurations in the future. Give zonal shifts a try and let us know your feedback!)”) limits across replicas, such as filesystem sizes, heap memory limits, certificate expiry times, the time a scheduled job runs, etc. Jittering limits between replicas can help contain the initial occurrence of limit-related issues to one replica (which you can afford to remove temporarily), so they do not arise in multiple replicas simultaneously.

Systems benefit from this replica strategy even more when the replicas align with independent physical fault containers. When building on AWS, you use AZs as the physical fault containers. AZs let us place our replicas in distinct physical data centers at meaningful distances (typically miles) to make sure that they have diverse power, connectivity, network devices, flood plains, etc. Again, aiming to minimize the number of events that any two replicas experience simultaneously and prevent correlated failure.

Recovering from hard failures

Having built the application, as multiple independent replicas, aligned with Availability Zones and provisioned with sufficient capacity to handle the loss of one replica, the next step is to make sure that you have mechanisms to rapidly detect and remove an unhealthy replica in an AZ (zonal replica). For those using ALB and NLB, the first line of defense against failures in an AZ is health checks. The load balancers probe each target at regular intervals to check for a healthy response (e.g., http status 200). If there is an unhealthy response or timeout, then a failure is detected and requests are routed away from the failing target typically in under a minute. Similarly, each load balancer node is health checked by Amazon Route 53 Health Checks, which will remove an AZ from the load balancer’s DNS if its targets are all unhealthy.

These target health checks are quick and effective in the face of hard or clearly detectable failures, such as a failed target instance, or an application which is no longer listening for connections or one which is returning http status 500 in response to health checks. To make sure that they are most effective, it’s often helpful to design a “deep” health check handler which will test the application more thoroughly. However, deep health checks require careful thought to avoid false-failures, such as in the event of overload. For more information, see the excellent Amazon Builder’s Library article on Implementing Health Checks.

In addition, ALB and NLB have recently added a new feature that lets you specify the minimum number of healthy targets in a given target group or AZ. This is helpful to make sure that, should one zonal replica experience failure and fall below a minimum configured capacity threshold, it will fail health checks and traffic will be routed to other replicas, thereby preventing the impaired replica from potentially being overwhelmed.

Recovering from gray failures

Even with deep health checks in place, more ambiguous, intermittent or “gray” failure modes can exist which are challenging to detect. For example, in the aftermath of a zonal deployment, a replica may respond “healthy” to probes, but have some functional bug which is impacting customers. Or perhaps the new code is performing less efficiently or crashing intermittently but still responding enough to appear healthy to the checkers. Subtle infrastructure issues such as packet loss or intermittent dependency failures can also result in slower responses, which still pass health checks.

For these gray failure situations, it’s helpful to have a higher level mechanism (either human or automated) which examines the customer experience across the zonal replicas and, where one zonal replica is experiencing a gray failure, shifts away from it. AWS has used this two-pronged strategy for many years, and we are now making it easier for customers to adopt a similar strategy when running applications on AWS.

First, we turn off cross-zone load balancing. As depicted in the following, this change reconfigures each zonal load balancer node to route requests to targets in the local AZ only. This change aligns the zonal fault containers between the load balancer and the targets and enables easier detection of failures within a single zonal replica. This alignment capability is now available with both ALB and NLB.

Figure 1. Illustration of how requests are routed with cross-zone load balancing on and off.

Second, both ALB and NLB (with cross-zone load balancing turned off) now include “zonal shifts,” built-in recovery controls which let you temporarily shift work away from one AZ, should your application be unhealthy there. As these controls are built-in, no setup is required, though you may need to make sure that your AWS Identity and Access Management (IAM) user/role has permissions to call the zonal shift APIs. Zonal shifts let you temporarily “shift” your customer traffic away from one unhealthy zonal replica via a simple “start-zonal-shift” API. If the remaining replicas are healthy and have capacity to serve customers, then you can restore the customer experience within minutes. Then, with your customers happily back to normal, you can work to debug and repair the unhealthy zonal replica. Once you’re ready to return the workload to the zone that is repaired, you can either cancel the zonal shift or simply let it expire.