This notion page describes how peers in a Kademlia DHT can infer the global network size without any additional networking overhead compared to the baseline implementation.

In the context of the Optimistic Provide project the global network size will feed into the closeness estimation. However, that knowledge can be used for all kinds of distributed algorithms or sanity checks.

<aside> ⚙️ Implementation: https://github.com/libp2p/go-libp2p-kad-dht/pull/783/files#diff-0a1e530b1aaeeaf12694f3d450b708cb852ef1c29f93d831c740f4efebcea063

</aside>

Kademlia DHT nodes refresh their routing table periodically every 10min (libp2p implementation). This process searches for each bucket in the routing table the `20`

closest peers to a randomly generated CID. So, e.g., for bucket 2 the node searches the `20`

closest peers to a random CID that shares `2`

bits as a prefix. Eventually, the node searches the `20`

closest peers to its own PeerID (common prefix length of `256`

).

We are using the distances of the `20`

closest peers to the randomly generated CIDs to derive a network size estimate. How we do this is described below.

<aside> 💡 The below is inspired by this blog post.

</aside>

The network consists of $n$ nodes $N_1, N_2, \dots, N_n$ where each node is identified by their PeerID. Relevant for the Kademlia DHT are the KadIDs which are SHA256 hashes of the PeerIDs (PeerIDs are just multi-hashes of the nodes public key). The KadIDs are bitstrings of length $L$ where $L=256$ in the case of the libp2p Kademlia DHT. Further, due to the hashing of PeerIDs KadIDs are uniformly distributed in the address space of size $2^L$.

Distances in the network are computed using the XOR metric. If we chose a random address $A$ in the address space the distance to another address $X$ would be:

$$ f(X) = X \oplus A $$

When we take the $n$ nodes $N_i$ with $i=1,\dots, n$ and calculate the distances $D_i$ to address $A$ we end up with the following set $D$ of distances

$$ D=\{D_1, D_2, \dots,D_n\} $$

where $D_1 = N_1 \oplus A, \, D_2 = N_2 \oplus A, \dots , D_n = N_n \oplus A$.

If we assume (without loss of generality) $D_1<D_2<\dots<D_n$ then $D_i=D_{(i)}$ where $D_{(i)}$ is called the $i$-th order statistic.

Now, we assume continuous random variables instead of discrete ones (which seems reasonably due to the large address space of size $2^L$). Then, we can use the probability distributions for order statistics that were sampled from a uniform distribution (also here).

These probability distribution functions are given by Beta distributions with varying values for $\alpha$ and $\beta$. In our case $\alpha=i$ where $i$ corresponds to the $i$-th order statistic in the range $i=1,\dots, n$ and $\beta=n-i+1$ (took this from Wikipedia and the link above).

Depending on the exact location of address $A$ in the key space, the distance to the closest of the $n$ nodes in the network varies. The distribution of those closest distances is what the probability distribution for order statistic $D_{(1)}$ shows us. Similarly, if we chose an address $A$ and look at the distribution of the second closest nodes we get the distribution of order statistic $D_{(2)}$.