https://www.youtube.com/watch?v=bXaL64fp55c
Large content providers don’t publish Provider Records to the DHT because the process is too resource consuming for large scale publication. This implies that this content isn’t discoverable on the DHT at all and must be discovered using Bitswap broadcast, which is terrible for many reasons. Enabling large Content Providers to publish their content on the DHT is necessary before turning off the Bitswap broadcasting feature, otherwise some content cannot be found. Turning off Bitswap broadcast is a major milestone to making IPFS more resource efficient. It will help all IPFS peers, but especially large content providers to reduce their bandwidth bill, as they won’t be spammed anymore by Bitswap. Hence, all the IPFS ecosystem would benefit from the large content providers publishing to the DHT.
One easy improvement to make the DHT Provide operation MUCH cheaper is to add a ProvideMultiple
RPC.
go-ipfs-provider
encapsulates some of the logic. Then kubo
perdiodically calls the Provide
method exposed by go-libp2p-routing-helpers
.
Move reprovide logicfor the DHT to go-libp2p-kad-dht
. kad-dht
should expose an interface to kubo
to add keys to reprovide and to remove keys that don’t need to be reprovided anymore. kubo
doesn’t need to manage republish itself. It can pass some parameters to kad-dht
concerning the reprovide strategy.
IPNI
doesn’t need reprovide by design. Hence the reprovide strategy should be Content Router Specific, and managed by the Content Router.
go-libp2p-routing-helpers
must expose a Reprovide API (e.g reprovide tracker add
and remove
). The Content Routers should manage the reprovide by themselves, and possibly accept a reprovide strategy passed down from kubo
.
create the DHT as content router
when a new CID is added to kubo and should be provided, call contentrouter.StartProviding(CID).
The DHT manages all the rest.
when kubo wants to stop providing some content, it calls contentrouter.StopProviding(CID).
ReprovideSweep
DesignAll keys located in the same keyspace region are reprovided all at once. As some large Content Providers are publishing more CIDs than there are DHT Servers, by the pigeonhole principle there must be DHT Servers that are allocated more than one Provider Record, by this Content Provider. The primary rationale is to (Re)Provide all Provider Records allocated to the same DHT Server at once.
As sending multiple Provider Records requires a new RPC, causing a breaking change it isn’t trivial to send all Provider Records exactly at once. However, the most expensive part in a (Re)Provide operation is the DHT walk to discover the right DHT Servers to store the Provider Records on, and opening a new connections to these peers. Once these peers are known, and a connection is already open, the Content Provider can simply reuse the same connection to send multiple individual Provide
requests.
The DHT implementation go-libp2p-kad-dht
must keep track of the CIDs that must be republished, every Interval
(let’s assume that all Provider Records are republished at the same frequency). The Kademlia identifiers of the CIDs to republish, must be arranged in a binary trie to allow a faster access. As each Provider Record is replicated on 20 different DHT Servers, 20 DHT Servers in a close locality are expected to store the same Provider Records (in reality not exactly, see Advanced Design for a precise explanation). The Content Provider will continuously lookups keys in the keyspace, from left to right, hence sweeping the keyspace. For each requested key, it will find the 20 closest peers, and lookup in its CIDs Republish Binary Trie all Provider Records that would belong