<aside> 💡

Partition = Shard (in a database context) = Node (in a server context)

</aside>

Modulo Hashing (and Its Problem)

serverIndex = hash(key) % n

(n is the size of the server pool)

It dispatch traffic evenly only when servers work as plan, no new servers are added, removed, or failed.

If the server number is changed, it requires rearranging all n servers’ data, if we don’t rearrange data, clients will connect to the wrong servers to fetch data.

Like when a HashMap reaches its underlying array capacity, it needs to reallocate all data, because the new capacity n is different, the new n affects the hashing calculation.

Constant Hashing (Hash Ring)

When a hash table is resized and consistent hashing is used, only k/n keys need to be remapped on average.

(k is the number of keys, n is the number of servers)

Why k/n time complexity?

Each server owns about k/n keys, when a server is down, we only need to remapping the data to the next server, instead of remapping all other server data.

SHA-1’s ****hash values fall between 0 and 2^160-1. (Hash Space)

Connect both ends, the 0 and 2^160-1 hash values will land at the same position. So the hashing forms the hashing ring.

There is no modular operation, keys may not lay on the server positions on the ring, instead, keys run clockwise to decide which server it belongs to.

Virtual Nodes

When a server is added or removed, servers on the ring may not be evenly distributed.

With virtual nodes, each server is represented by multiple virtual nodes on the ring. When a key runs clockwise and meets a virtual node, the key will belong to the virtual node’s representing server.

As the number of virtual nodes increases, the distribution of keys becomes more balanced. It is easy to scale horizontally. However, more spaces are needed to store data about virtual nodes.

References