Motivation

The topology of structured overlay networks such as DHTs, follow rules that enable routing among the nodes that constitute the network. In DHTs, each node is attributed a random bit string identifier that is comparable according to some distance system. This results in situations where nodes that share similar properties, such as belonging to the same geographical region, to not be placed close at the DHT level. Multi-level DHT designs aim at grouping nodes that share pre-defined properties in the DHT to improve the performance of routing among nodes that share those properties.

The IPFS network leverages a DHT to share content among nodes in the network. In this network we can define nodes that provide content as providers, and clients that request content as requesters.

The aim of this measurement project is to find if there is locality of interest in the IPFS network (i.e., if requesters request content provided by providers that are in the same geographic region). The goal of this measurement is to understand if and when the IPFS network will benefit from a Multi-level DHT design, and to guide the design of future solutions.

Summary of Findings

In order to achieve the goal of the project we collected the logs of one of the most popular IPFS gateways — ipfs.io. These logs contained HTTP requests made by external clients to content stored in the IPFS network. With this log, we extracted the requested CIDs (i.e., IPFS content identifiers) and the geo-location of requesters. We then performed our own requests to the IPFS network requesting the providers of the CIDs found in the gateway logs and stored the geo-location of the providers. With this data it was possible to get the geo-locality of requested content in the IPFS network.

In more detail, the ipfs.io logs contained a bit more than 58 million HTTP valid requests that requested a bit more than 4 million different CIDs (there were multiple requests to the same CIDs). We tried to fetch the providers for each of the 4 million CIDs, however we only managed to find providers for around 45% of all CIDs. In total, we found a bit more than 55 thousand different providers. Nevertheless, some of the provider records we found lacked addressing information (i.e., the provider record had peer ids with no associated multi-address), which made it impossible to extract geo-location information. As such, we only considered for this study the providers that had addressing information, which amounts to about 48% of providers found.

In our analysis we study multiple aspects related to how content is distributed among providers and how requests are distributed across different content. In particular, we found that requests to content follow a Zipf distribution; meaning that in the IPFS network there are a select few CIDs that are highly popular (i.e., are requested a lot of times) and there are a large number of CIDs that are not popular at all, being requested only a single time. Furthermore, and interestingly, we found that the popular CIDs are not the CIDs that have higher numbers of providers. In fact, the most popular CIDs have less than 10 providers. However, we found very interesting outliers to this. In particular there was a very popular CID that had much more providers than the remaining popular CIDs. We found that this CID had a different nature from the rest.

Finally, we analysed the relation between the geo-location of the origin of requests and the geo-location of providers, and found that there is very little relation. For example, only 5.24% of all requests originating from Asia actually have providers in Asia, while 30.32% and 48.41% of all requests originating from Asia have providers in Europe and North America respectively. This is due to most content and providers being located in Europe and North America.

We have written a paper detailing our study. You can find a pre-print version of the paper here: https://arxiv.org/abs/2212.07375.

In the following we provide more details on how we reached these conclusions and provide further insights.

Measurement Methodology & Architecture

Figure 1 represents our measurement architecture. Our main component is the controller component that coordinates the whole process of processing the gateway logs, requesting to find the providers, and finally populating a database with relevant information. The controller feeds the gateway logs through a message broker, in our case a simple rabbitMQ instance. After this, the controller requests the Parser component to parse a log entry into a structured data format such as JSON. From the parsed data, the controller requests the Find Providers component to query the IPFS DHT for the providers of the requested CID. With the providers returned, the controller requests again from the parser to get geo-location information from the IP address of the providers found. The controller writes the data processed to a database component, in our case a postgres database, that support a dashboard (grafana) to visualize the data.

Figure 1. Measurement architecture.

Figure 1. Measurement architecture.

The architecture presented in Figure 1 is prepared for continuous monitoring. We also used tools for offline processing. In fact, we divided the process of parsing the HTTP requests from the fetching of provider records (i.e., we only fetched the providers once all HTTP requests have been parsed). This allowed us to be more effective in fetching the provider records since we could fetch the providers for all CIDs only once. Furthermore, we also used python scripts to create the plots presented in this report.

In the following we provide some more details on the steps used to obtain the locality of request in the IPFS network.

Step 1. Parse data