Ed Huang

https://www.usenix.org/system/files/atc22-elhemali.pdf

It's been a long time since DynamoDB publish a paper, I read the paper a few days ago, and I thought that would be one of the most practical papers in recent years for large-scale distributed systems from an engineering perspective. It doesn't even explain much about the architecture (because it doesn't need to, Shared Nothing systems look similar, and the authors know exactly who the readers of this paper are ^_^), and the paper is basically written in plain words, not fancy at all (and no math!). After all, DynamoDB doesn't need to try to 'prove' anything, it has fully proven itself in the past 10 years, both in terms of scale and stability, I found that the paper is relatively new, and there are not too many blogs talk about it, so why not I write something about it?

Predictable Performance > High Performance

This is the first point I put in this post, and it is also the point that touches me most deeply when we built TiDB's cloud service in recent years. I always make this statement on different occasions: Stable slow is better than unstable fast, .99 Latency is more reflective of the system design skills than Avg Latency.

I don't know if this is intentional, but it is also presented in the first section of the paper DynamoDB 2022, so you can see the importance.

DynamoDB wants to provide predictable performance, the first step is to abstract the workload, which introduces the concept of RCU, and WCU; in fact, RCU and WCU are very close to QPS in the traditional sense, only the size of the target item is added, so that you can do relatively accurate workload planning, for example, 1 WCU = 1KB item’s 1 OPS. When the user can describe the workload in terms of RCU and WCU, the first step to predictability is complete, DynamoDB's scheduler can do a lot of things like pre-partitioning and pre-allocating resources, because the hardware capabilities for different models can be simply abstracted into a combination of WCUs and RCUs.

Once you know the quota of each partition, it is probably a backpack problem for the scheduling. DynamoDB will consider the sum of the quotas of the partitions of the same machine, the sum should be less than the total RCU/WCU that the machine can provide when allocating the quota of partition, which is about 20%-30% from the example given in the paper. In the real world, inexperienced developers or system administrators will usually squeeze the last bit of CPU and IO out of the machine in order to pursue the 'ultimate' performance, and must see 100% CPU Usage before they are satisfied, but in such case the machine is already in a very unhealthy state, as reflected by the long tail latency of requests will become very high, even though there may be an increase in throughput, but because of this unstable long tail, the observed performance from user’s perspective is 'unpredictable'. In a production environment, we usually recommend over-provisioning about 30% of the machines for the same reason.

Burst is a simple idea. When allocating quota, DynamoDB will reserve some capacity for each partition . When the traffic spikes in the short term, use the reserved capacity. Adaptive Capacity is to dynamically adjust the quotas of different partitions after the user's workload is skewed (but the total amount cannot exceed the total quota of the table).

It is important to note that Burst and Adaptive Capacity are based on an assumption that the user's workload does not change much, and also the flow control is focused on the partition level (that is, almost at the Storage Node level), that is, local scheduling.

In a large-scale storage system, flow control is actually a global problem. Using TiDB as an example, TiDB's SQL layer is a layer is a stateless layer. The requests will be forwarded to TiKV. TiDB SQL layer is a kind of 'Request Router'(using the term from the paper), but if multiple TiDB SQL nodes are deployed, flow control is only done on TiKV (storage layer). In extreme cases, TiDB SQL nodes might still keep hitting the overloaded TiKV Node with requests. To solve this problem, you actually need to do flow control in the TiDB layer, return errors directly to the client on the TiDB layer, and not penetrate the overloaded TiKV nodes.

This part in DynamoDB's paper is a bit vague. My personal understanding of the strategy is that Request Router applies request quota to GAC periodically. GAC will maintain a global allocation policy, and if some partition has been overloaded, the corresponding Request Router can directly deny service to customers to protect the rest of the system. The flow control at the partition level is also kept as the last line of defense at the node level.

For a Shared-Nothing storage system, the key to scale-out is the sharding strategy, DynamoDB chooses a dynamic sharding strategy, just like TiKV, it also uses Range Split, but the difference is that TiKV's Region (similar to DynamoDB's Partition concept) by default use size as the splitting threshold (in TiDB 6.0, the Load-based Split strategy was introduced), DynamoDB directly adopts Load-based Split. Partitions have a default throughput threshold, which will be split when it exceeds this value. And once you start monitoring the state of the load distribution over the key range, it is easy to get the optimal splitting point (which is not always the midpoint).

Another interesting thing about splitting is, when you should avoid splitting, the paper mentions:

  1. single row hotspot, which is well understood.
  2. The access pattern is sequential order of keys (similar to iteration on Key). In this case, DynamoDB will avoid splitting.

To summarize, the core of DynamoDB's predictable performance lies in:

  1. More accurate abstraction of Workload (WCU/RCU), instead of simple TPS/QPS/Data Size
  2. Pre-allocate quotas to partitions and strictly limit the flow