Step 1. Extract requested CIDs and related data.

  1. Parse data from IPFS gateways into a structured data format such as csv for example.

    IPFS gateway logs come in the form of a nginx server logs, in the following format:

    199.83.232.50 - - [2022-03-21T00:00:58+00:00] "GET /ipfs/QmPvt7yHLGpVhd4jVFX6reEZZ34fQRepQ1a1QTsZQBH1hJ/KittyCat3621.png HTTP/1.1" 200 50470 120 12.823 12.820 12.820 MISS "-" "-" *.i.ipfs.io ipfs.io https
    <ip> <x> <y> <time> <request> <status> <body_bytes> <req_lenght> <request_time> <upstream_response_time> <upstream_header_time> <upstream_cache_status> <http_refer> <http_user_agent> <server_name> <http_host> <scheme>
    

    Putting the above example into their respective fields we get:

    ip: 199.83.232.50
    x: -
    y: -
    time: [2022-03-21T00:00:58+00:00]
    request: "GET /ipfs/QmPvt7yHLGpVhd4jVFX6reEZZ34fQRepQ1a1QTsZQBH1hJ/KittyCat3621.png HTTP/1.1"
    status: 200
    body_bytes: 50470
    req_length: 120
    request_time: 12.823
    upstream_response_time: 12.820
    upstream_header_time: 12.820
    upstream_cache_status: MISS
    http_refer: "-"
    http_user_agent: "-"
    server_name: *.i.ipfs.io
    http_host: ipfs.io
    scheme: https
    

    Each field has the following description:

    ip: the ip address of the requester
    x: -
    y: -
    time: the time the request was issued
    request: A string representing the original request
    status: http status of the response
    body_bytes: bytes returned to the requesters
    req_length: bytes of the request
    request_time: time it took to resolve request in seconds
    upstream_response_time: time to receive response from upstream server in seconds
    upstream_header_time: time to receive response header from upstream server in seconds
    upstream_cache_status: status os accessing response cache
    http_refer: ???
    http_user_agent: user agent used to perform http request (e.g., browser version)
    server_name: name of the server that accepted the request
    http_host: http host
    scheme: request scheme (http or https)
    

    Parse log entry into these fields to get structured data:

    {
    	"ip": the ip address of the requester,
    	"time": the time the request was issued, 
    	"op": http operation (GET, POST, ...),
    	"target": target url for http operation,
    	"http": http version used,
      "status": http status of the response,
      "body_bytes": bytes returned to the requesters,
      "request_length": bytes of the request,
      "request_time": time it took to resolve request in seconds,
      "upstream_response_time": time to receive response from upstream server in seconds,
      "upstream_header_time": time to receive response header from upstream server in seconds,
      "cache": status os accessing response cache,
      "http_refer": http_refer,
      "http_user_agent": user agent used to perform http request (e.g., browser version),
      "server_name": name of the server that accepted the request,
      "http_host": http_host,
      "scheme": scheme
    }
    

    Parsing the above request will result in:

    {
    	"ip": 199.83.232.50,
    	"time": [2022-03-21T00:00:58+00:00], 
    	"op": GET,
    	"target": /ipfs/QmPvt7yHLGpVhd4jVFX6reEZZ34fQRepQ1a1QTsZQBH1hJ/KittyCat3621.png,
    	"http": HTTP/1.1,
      "status": 200,
      "body_bytes": 50470,
      "request_length": 120,
      "request_time": 12.823,
      "upstream_response_time": 12.820,
      "upstream_header_time": 12.820,
      "cache": MISS,
      "http_refer": -,
      "http_user_agent": -,
      "server_name": *.i.ipfs.io,
      "http_host": ipfs.io,
      "scheme": https
    }
    

    Things to note:

    1. Sometimes the fields upstream_response_time and upstream_header_time may have more than one value, separated by commas:

      199.83.232.55 - - [2022-03-21T00:01:02+00:00] "GET /ipfs/QmPZLypREH8okGjFjSagUcbbrJhBJHhGsDVKfkZM62FBxs/5806.json HTTP/1.1" 200 727 113 61.430 60.009, 1.424 60.009, 1.424 MISS "-" "-" *.i.ipfs.io ipfs.io https
      upstream_response_time: 60.009, 1.424
      upstream_header_time: 60.009, 1.424
      

      When this happens, means that more than one upstream requests were made.

    2. http_refer and http_user_agent are string fields similar to the request field.

    Parsing strategy:

    1. Parse string fields. String fields are enclosed by “, can easily use regex:

      \\"(.*?)\\"
      
    2. Remove string fields from log entry.

    3. Split log entry by space and keep a token counter.

    4. For upstream_request_time and upstream_header_time advance token counter while end of token is a comma.

    A simple python program to parse a log entry would look like this:

    import re #import for regex
    
    matches = re.findall('\\"(.*?)\\"', logEntry) #finds all matches
    request = matches[0]
    http_refer = matches[1]
    http_user_agent = matches[2]
    tokens = request.split(' ')
    op = tokens[0]
    target = tokens[1]
    http = tokens[2]
    
    entry = re.sub('\\"(.*?)\\"', '', logEntry) #substitutes all matches with '' in line
    tokens = entry.split(' ')
    i = 0
    ip = tokens[i]
    i += 3
    time = tokens[i]
    i += 1
    status = tokens[i]
    i += 2 #need to jump over 2 spaces
    body_bytes = tokens[i]
    i += 1
    request_length = tokens[i]
    i += 1
    request_time = tokens[i]
    i += 1
    
    upstream_response_time, upstream_header_time = [],[]
    while tokens[i][-1] == ',':
    	upstream_response_time.append(tokens[i][-1])
    	i += 1
    	
    upstream_response_time.append(tokens[i])
    i += 1
    while tokens[i][-1] == ',':
    	upstream_header_time.append(tokens[i][-1])
    	i += 1
    
    upstream_header_time.append(tokens[i])
    i += 1
    cache = tokens[i]            
    i += 3 #need to jump over 3 spaces
    server_name = tokens[i]
    i += 1
    http_host = tokens[i]
    i += 1
    scheme = tokens[i]#[:-1] if \\n in log entry
    

    Note there might be out of format lines! Encapsulate a try catch in this code to deal with that.

  2. Filter data.

    To get the request pattern of IPFS we need the following requests that match the following restrictions:

    We just need to apply these restrictions to the structured data we parsed before. Applying these to filter data can be easily done with pandas for example:

    import pandas as pd #import for pandas
    
    #load a csv file to a pandas dataframe
    df = pd.read_csv(parsedLogFile, keep_default_na=False)
    
    #filter for GET operations
    df = df[df['op'] == 'GET']
    
    #filter for successful operations
    df = df.astype({'status': int})
    df = df[(df['status'] >= 200) & (df['status'] < 300)]
    
    #filter for remote requests
    df = df[(df['ip'] != '127.0.0.1') & (df['ip'] != '::1')]
    
  3. Extract CIDs from data.

    We need to extract the CIDs from the data. CIDs can be usually found in the target field. However, there are instances in the logs, where the CID is the http_host field.

    To this end, we need to join these two fields and search the CID in the resulting string.

    CIDs can be found by first splitting the string by words searching for a starting substring:

    import re #import for regex
    
    link = http_host + target
    cid = []
    cid.extend(re.findall('Qm\\w+', link))
    cid.extend(re.findall('baf\\w+', link))
    if len(cid) == 1:
    	return cid[0]
    elif len(cid) < 0:
    	return Nan
    else:
    	## which cid should we return ? all ?
    	## return the first for now
    	return cid[0]
    
  4. Get geo-location of requester IP

    We need to find the geo-location of requesters to map the location of requests. To this end, use a geo-ip database such as MaxMind.

    To get this data, we use a python library: python-geoip-geolite2

    from geoip import geolite2 #import geoip database
    
    match = geolite2.lookup(ip)
    
    if match is not None:
    	continent = match.continent #return the continent
    	country = match.country #returns the country
    	regions = match.subdivisions #this will return a list of the regions
    else:
    	## handle case when there is no value.
    

    The result from the lookup might not always return a full value. It can so happen that only the continent is known for an IP address. Most matches will not have regions. Regions can be used for large countries, such as the USA, where each region will encode a state.

    Note:

  5. Putting all together.

    The final result should be a dataset that associates each requested CID to an IP address and to the location of that IP address:

    {
    	"cid": the requested CID
    	"ip": the IP address from where the request was made
    	"continent": the continent associated to the IP address
    	"country": the country associated to the IP address
    	"region": the region associated to the IP address
    }
    

    Following the example log entry we have been using, this should be the resulting entry:

    {
    	"cid": QmPvt7yHLGpVhd4jVFX6reEZZ34fQRepQ1a1QTsZQBH1hJ
    	"ip": 199.83.232.55
    	"continent": NA
    	"country": US
    	"region": [FL]
    }
    

    Note:

    In this dataset it is not useful to keep the data that has no information. Data such where no CID was found, or no location was found. So one should filter out non-existing values.

    This step can be done when extracting the CIDs and the locations. For this, put the resulting data in a pandas dataframe, and put a Nan value for non-existing values on the dataset.

    This can then be filtered out easily with: df = df.dropna()

Step 2. Find requested CIDs providers and related data.

  1. Extract unique CIDs from requested CIDs dataset.

    This can be easily done on python with pandas:

    import pandas as pd 
    
    # load requested cids dataset. 
    # add option of not keeping default na values, to not treat North America values (NA) as a Nan
    df = pd.read_csv('requested_cids.csv', keep_defaul_na=False)
    
    df = df[['cid']].drop_duplicates()
    
    df.to_csv('unique_cids.csv', index=False)
    
  2. Find the providers for these unique cids.

    To do this, we need to connect to the IPFS DHT and perform get provider requests.

    To connect to IPFS DHT we can write a simple go-libp2p program that uses the libp2p dht. The go-libp2p dht connects by default to the IPFS DHT.

    The go-libp2p dht provides 2 main functions to get providers:

    Notice that these functions have different returns. The synchronous version returns a []peer.AddrInfo while the asynchronous version returns a <-*chan* peer.AddrInfo

    A peer.AddrInfo represents a provider record that contains:

    The way the code is structured (in go-libp2p-kad-dht v0.15.0) is that the synchronous FindProviders calls and waits for the asynchronous FindProvidersAsync function.

    The FindProviders functions call the FindProviderAsync function as such:

    func (dht *IpfsDHT) FindProviders(ctx context.Context, c cid.Cid) ([]peer.AddrInfo, error) {
    	...
    	var providers []peer.AddrInfo
    	for p := range dht.FindProvidersAsync(ctx, c, dht.bucketSize) {
    		providers = append(providers, p)
    	}
    	return providers, nil
    }
    

    By default dht.bucketSize=20 this means that the FindProviders function will try to find 20 providers for a CID.

    A simple program in Go to fetch all (at a maximum of 20) providers for a given cid would look like this:

    import (
      "flag"
      "context"
    	dht "github.com/libp2p/go-libp2p-kad-dht"
    )
    
    func main() {
    	cid := flag.String("cid", "", "CID to search")
    
    	kad, err := dht.New(context.Background(), h, dht.Mode(dht.ModeClient))
    		
    	providers, err := kad.FindProviders(context.Background(), *cid)
    	if err != nil {
    		panic(err)
    	}
    	print(providers)
    
    }
    

    To later ease the data processing of the output of this program, the program should log easily parsable data, or already structured data.

    A possible way would be with a function such as:

    func logAnswer(a answer) {
    	if a.err != nil {
    		log.Println("Failed: ", a.cid, "err: ", a.err, " in peers: ", a.p, " time: ", a.dur)
    	} else {
    		log.Println("Found: ", a.cid, " in peers: ", a.p, " time: ", a.dur)
    	}
    }
    

    Where an answer has the following structure:

    type answer struct {
    	p     []peer.AddrInfo //the peer infos of providers
    	cid   cid2.Cid //the cid that was searched
    	dur   time.Duration //the duration of the request
    	err   error //the error if occurred any, or nil
    }
    
  3. Process provider data.

    Having the logs of the previous program, parse the logs. One way to do this is again with a simple python script. Assuming the logs have format presented above, the parsing strategy is the following.

    Parsing strategy:

    1. Locate each field in the log. This can be easily done with a regex:

      "(.*) Found:  (.*)  in peers:  (.*)  time:  (.*)"
      
    2. Store each field in a variable.

    3. Peers is a list with the following format:

      [{<PeerID>: [<maddr1>, <maddr2>, ...]}, ... ]
      

      To process this use the following regex to get all entries:

      {(.*?): [(.*?)]}
      

      The resulting python code to process the providers is the following:

      import re #import for regex
      
      providers = []
      
      # find all matches of the regex
      for match in re.finditer("{(.*?): \\[(.*?)\\]}", peers):
      	maddrs = []
      	if match is None: #if no match continue
      		continue
      	id = match.group(1) 
      	addrs = match.group(2)
      	if id and addrs: #if exists both then process addrs
      		for addr in addrs.split(" "):
      			maddrs.append(addr)
      	providers.append((id, maddrs))
      
      return providers
      
    4. Extract the IP address from the multiaddress.

      The multiaddress format is the following:

      /<network protocol>/<network address>/<transport protocol>/<transport address>/...
      

      We are interested in the network address which depends on the network field. Interested network protocols are the following:

      • ip4
      • ip6
      • dns

      The following python function can extract the ip address from the multiaddress:

      import socket
      import numpy as np
      
      def extractIpsFromMaddr(maddr):
          try:
              splitted = maddr.split('/')
              proto = splitted[1]
              ip = splitted[2]
              if proto == 'ip4' or proto == 'ip6':
                  return ip
              elif 'dns' in proto:            
                  try:
                      host = socket.gethostbyname(ip)
                  except:
                      return np.NAN            
                  return host
              else:
                  #print(maddr)
                  return np.NAN
          except:
              #print(maddr)
              return np.NAN
      
    5. Putting all together:

      Similarly to the first step, the final result should be a dataset that associates each requested CID to an IP address and to the location of that IP address:

      {
      	"cid": the requested CID
      	"ip": the IP address of the CID provider
      	"continent": the continent associated to the IP address
      	"country": the country associated to the IP address
      	"region": the region associated to the IP address
      }
      

      The location of the provider can be extracted similarly to the first step using the MaxMind geo-ip database.

Step 3. Merge and Visualise data

  1. Merge requester data with provider data

    The resulting datasets from the previous steps can be viewed as two tables that share common value:

    Requesters Providers