Step 1. Extract requested CIDs and related data.

  1. Parse data from IPFS gateways into a structured data format such as csv for example.

    IPFS gateway logs come in the form of a nginx server logs, in the following format: - - [2022-03-21T00:00:58+00:00] "GET /ipfs/QmPvt7yHLGpVhd4jVFX6reEZZ34fQRepQ1a1QTsZQBH1hJ/KittyCat3621.png HTTP/1.1" 200 50470 120 12.823 12.820 12.820 MISS "-" "-" * https
    <ip> <x> <y> <time> <request> <status> <body_bytes> <req_lenght> <request_time> <upstream_response_time> <upstream_header_time> <upstream_cache_status> <http_refer> <http_user_agent> <server_name> <http_host> <scheme>

    Putting the above example into their respective fields we get:

    x: -
    y: -
    time: [2022-03-21T00:00:58+00:00]
    request: "GET /ipfs/QmPvt7yHLGpVhd4jVFX6reEZZ34fQRepQ1a1QTsZQBH1hJ/KittyCat3621.png HTTP/1.1"
    status: 200
    body_bytes: 50470
    req_length: 120
    request_time: 12.823
    upstream_response_time: 12.820
    upstream_header_time: 12.820
    upstream_cache_status: MISS
    http_refer: "-"
    http_user_agent: "-"
    server_name: *
    scheme: https

    Each field has the following description:

    ip: the ip address of the requester
    x: -
    y: -
    time: the time the request was issued
    request: A string representing the original request
    status: http status of the response
    body_bytes: bytes returned to the requesters
    req_length: bytes of the request
    request_time: time it took to resolve request in seconds
    upstream_response_time: time to receive response from upstream server in seconds
    upstream_header_time: time to receive response header from upstream server in seconds
    upstream_cache_status: status os accessing response cache
    http_refer: ???
    http_user_agent: user agent used to perform http request (e.g., browser version)
    server_name: name of the server that accepted the request
    http_host: http host
    scheme: request scheme (http or https)

    Parse log entry into these fields to get structured data:

    	"ip": the ip address of the requester,
    	"time": the time the request was issued, 
    	"op": http operation (GET, POST, ...),
    	"target": target url for http operation,
    	"http": http version used,
      "status": http status of the response,
      "body_bytes": bytes returned to the requesters,
      "request_length": bytes of the request,
      "request_time": time it took to resolve request in seconds,
      "upstream_response_time": time to receive response from upstream server in seconds,
      "upstream_header_time": time to receive response header from upstream server in seconds,
      "cache": status os accessing response cache,
      "http_refer": http_refer,
      "http_user_agent": user agent used to perform http request (e.g., browser version),
      "server_name": name of the server that accepted the request,
      "http_host": http_host,
      "scheme": scheme

    Parsing the above request will result in:

    	"time": [2022-03-21T00:00:58+00:00], 
    	"op": GET,
    	"target": /ipfs/QmPvt7yHLGpVhd4jVFX6reEZZ34fQRepQ1a1QTsZQBH1hJ/KittyCat3621.png,
    	"http": HTTP/1.1,
      "status": 200,
      "body_bytes": 50470,
      "request_length": 120,
      "request_time": 12.823,
      "upstream_response_time": 12.820,
      "upstream_header_time": 12.820,
      "cache": MISS,
      "http_refer": -,
      "http_user_agent": -,
      "server_name": *,
      "scheme": https

    Things to note:

    1. Sometimes the fields upstream_response_time and upstream_header_time may have more than one value, separated by commas: - - [2022-03-21T00:01:02+00:00] "GET /ipfs/QmPZLypREH8okGjFjSagUcbbrJhBJHhGsDVKfkZM62FBxs/5806.json HTTP/1.1" 200 727 113 61.430 60.009, 1.424 60.009, 1.424 MISS "-" "-" * https
      upstream_response_time: 60.009, 1.424
      upstream_header_time: 60.009, 1.424

      When this happens, means that more than one upstream requests were made.

    2. http_refer and http_user_agent are string fields similar to the request field.

    Parsing strategy:

    1. Parse string fields. String fields are enclosed by “, can easily use regex:

    2. Remove string fields from log entry.

    3. Split log entry by space and keep a token counter.

    4. For upstream_request_time and upstream_header_time advance token counter while end of token is a comma.

    A simple python program to parse a log entry would look like this:

    import re #import for regex
    matches = re.findall('\\"(.*?)\\"', logEntry) #finds all matches
    request = matches[0]
    http_refer = matches[1]
    http_user_agent = matches[2]
    tokens = request.split(' ')
    op = tokens[0]
    target = tokens[1]
    http = tokens[2]
    entry = re.sub('\\"(.*?)\\"', '', logEntry) #substitutes all matches with '' in line
    tokens = entry.split(' ')
    i = 0
    ip = tokens[i]
    i += 3
    time = tokens[i]
    i += 1
    status = tokens[i]
    i += 2 #need to jump over 2 spaces
    body_bytes = tokens[i]
    i += 1
    request_length = tokens[i]
    i += 1
    request_time = tokens[i]
    i += 1
    upstream_response_time, upstream_header_time = [],[]
    while tokens[i][-1] == ',':
    	i += 1
    i += 1
    while tokens[i][-1] == ',':
    	i += 1
    i += 1
    cache = tokens[i]            
    i += 3 #need to jump over 3 spaces
    server_name = tokens[i]
    i += 1
    http_host = tokens[i]
    i += 1
    scheme = tokens[i]#[:-1] if \\n in log entry

    Note there might be out of format lines! Encapsulate a try catch in this code to deal with that.

  2. Filter data.

    To get the request pattern of IPFS we need the following requests that match the following restrictions:

    We just need to apply these restrictions to the structured data we parsed before. Applying these to filter data can be easily done with pandas for example:

    import pandas as pd #import for pandas
    #load a csv file to a pandas dataframe
    df = pd.read_csv(parsedLogFile, keep_default_na=False)
    #filter for GET operations
    df = df[df['op'] == 'GET']
    #filter for successful operations
    df = df.astype({'status': int})
    df = df[(df['status'] >= 200) & (df['status'] < 300)]
    #filter for remote requests
    df = df[(df['ip'] != '') & (df['ip'] != '::1')]
  3. Extract CIDs from data.

    We need to extract the CIDs from the data. CIDs can be usually found in the target field. However, there are instances in the logs, where the CID is the http_host field.

    To this end, we need to join these two fields and search the CID in the resulting string.

    CIDs can be found by first splitting the string by words searching for a starting substring: