to track the theoretical record lifetime for a range of CIDs across the hash space

General

Maintainer: @Mikel Cortes

Study description: Notion Doc

Theoretical Study: DHT Theoretical Record Lifetimes

RFM: 17

DGM Grant: IPFS Provider Record Liveness

GitHub Repo: cortze/ipfs-cid-hoarder

Report: rfm17

Motivation

In content sharing platforms, distributed or non-distributed ones, the content always needs to be stored somewhere. In the IPFS network, although the content may be in more than one location, it normally starts from the IPFS client/server that has the content and publishes the Provider Records (PR) to the rest of the network. This PR contains the link between the CID (content) they are sharing and the multi-address where the content can be retrieved.

As explained in the Kademlia DHT paper, the PRs get shared with other K=20 peers of the network, corresponding K to the closest peers to the CID using the XOR distance. The described step corresponds to the PROVIDE method of the IPFS-Kad-DHT implementation, where the client/server finds out which peers are the closest ones to the CID, and then sends them an ADD_PROVIDE message to notify them that they are inside the set of closest peers to that specific content. After this process, any other peer that walks the IPFS DHT to retrieve that CID will ask for the closest peers to the CID and will ask one of these k=20 peers for the PR (ideal scenario).

The theory looks solid, the K=20 value was initially chosen to increase the network's resilience to node churn. At the moment, the overall network seems to be working fine, although there are some concerns about the impact of Hydra nodes and the churn rate.

The fact that k=20 peers keep the records means that the content should be retrievable as soon as one of them is actively keeping the PR. However, if after 4 hours of publishing the PRs, only one of the 20 peers keeps the records, one could conclude that the network is exposed to a very high node churn rate, and therefore, the K value won’t longer be the appropriate one for the node churn and network size at that specific moment.

There are some concerns about the impact of the Hydra-boosters in the network as they represent a more centralized infrastructure than the one targeted by the IPFS network. Hydra nodes are placed in the network to accelerate the content discovery and therefore, the performance of IPFS. However, Are they the ones keeping up IPFS Network alive?

The focus of the study is to tackle these concerns by generating a tool that can follow up with the peers chosen to be PR Holders, bringing up more insights about the Provider Record Liveness.

Methodology

IPFS-CID-Hoarder is the tool that will collect the data for the study (currently in development stage).

In this first approach of the CID-Hoarder, the tool will generate and track a set of CIDs over time. The randomness of the CIDs helps to cover the entire hash space homogeneously, which consequently helps to understand which range of the hash space suffers more from node churn or lack of peers.

As explained before, the tool will have a set of inputs to configure the study: