Parse data from IPFS gateways into a structured data format such as csv for example.
IPFS gateway logs come in the form of a nginx server logs, in the following format:
199.83.232.50 - - [2022-03-21T00:00:58+00:00] "GET /ipfs/QmPvt7yHLGpVhd4jVFX6reEZZ34fQRepQ1a1QTsZQBH1hJ/KittyCat3621.png HTTP/1.1" 200 50470 120 12.823 12.820 12.820 MISS "-" "-" *.i.ipfs.io ipfs.io https
<ip> <x> <y> <time> <request> <status> <body_bytes> <req_lenght> <request_time> <upstream_response_time> <upstream_header_time> <upstream_cache_status> <http_refer> <http_user_agent> <server_name> <http_host> <scheme>
Putting the above example into their respective fields we get:
ip: 199.83.232.50
x: -
y: -
time: [2022-03-21T00:00:58+00:00]
request: "GET /ipfs/QmPvt7yHLGpVhd4jVFX6reEZZ34fQRepQ1a1QTsZQBH1hJ/KittyCat3621.png HTTP/1.1"
status: 200
body_bytes: 50470
req_length: 120
request_time: 12.823
upstream_response_time: 12.820
upstream_header_time: 12.820
upstream_cache_status: MISS
http_refer: "-"
http_user_agent: "-"
server_name: *.i.ipfs.io
http_host: ipfs.io
scheme: https
Each field has the following description:
ip: the ip address of the requester
x: -
y: -
time: the time the request was issued
request: A string representing the original request
status: http status of the response
body_bytes: bytes returned to the requesters
req_length: bytes of the request
request_time: time it took to resolve request in seconds
upstream_response_time: time to receive response from upstream server in seconds
upstream_header_time: time to receive response header from upstream server in seconds
upstream_cache_status: status os accessing response cache
http_refer: ???
http_user_agent: user agent used to perform http request (e.g., browser version)
server_name: name of the server that accepted the request
http_host: http host
scheme: request scheme (http or https)
Parse log entry into these fields to get structured data:
{
"ip": the ip address of the requester,
"time": the time the request was issued,
"op": http operation (GET, POST, ...),
"target": target url for http operation,
"http": http version used,
"status": http status of the response,
"body_bytes": bytes returned to the requesters,
"request_length": bytes of the request,
"request_time": time it took to resolve request in seconds,
"upstream_response_time": time to receive response from upstream server in seconds,
"upstream_header_time": time to receive response header from upstream server in seconds,
"cache": status os accessing response cache,
"http_refer": http_refer,
"http_user_agent": user agent used to perform http request (e.g., browser version),
"server_name": name of the server that accepted the request,
"http_host": http_host,
"scheme": scheme
}
Parsing the above request will result in:
{
"ip": 199.83.232.50,
"time": [2022-03-21T00:00:58+00:00],
"op": GET,
"target": /ipfs/QmPvt7yHLGpVhd4jVFX6reEZZ34fQRepQ1a1QTsZQBH1hJ/KittyCat3621.png,
"http": HTTP/1.1,
"status": 200,
"body_bytes": 50470,
"request_length": 120,
"request_time": 12.823,
"upstream_response_time": 12.820,
"upstream_header_time": 12.820,
"cache": MISS,
"http_refer": -,
"http_user_agent": -,
"server_name": *.i.ipfs.io,
"http_host": ipfs.io,
"scheme": https
}
Things to note:
Sometimes the fields upstream_response_time and upstream_header_time may have more than one value, separated by commas:
199.83.232.55 - - [2022-03-21T00:01:02+00:00] "GET /ipfs/QmPZLypREH8okGjFjSagUcbbrJhBJHhGsDVKfkZM62FBxs/5806.json HTTP/1.1" 200 727 113 61.430 60.009, 1.424 60.009, 1.424 MISS "-" "-" *.i.ipfs.io ipfs.io https
upstream_response_time: 60.009, 1.424
upstream_header_time: 60.009, 1.424
When this happens, means that more than one upstream requests were made.
http_refer and http_user_agent are string fields similar to the request field.
Parsing strategy:
Parse string fields. String fields are enclosed by “, can easily use regex:
\\"(.*?)\\"
Remove string fields from log entry.
Split log entry by space and keep a token counter.
For upstream_request_time and upstream_header_time advance token counter while end of token is a comma.
A simple python program to parse a log entry would look like this:
import re #import for regex
matches = re.findall('\\"(.*?)\\"', logEntry) #finds all matches
request = matches[0]
http_refer = matches[1]
http_user_agent = matches[2]
tokens = request.split(' ')
op = tokens[0]
target = tokens[1]
http = tokens[2]
entry = re.sub('\\"(.*?)\\"', '', logEntry) #substitutes all matches with '' in line
tokens = entry.split(' ')
i = 0
ip = tokens[i]
i += 3
time = tokens[i]
i += 1
status = tokens[i]
i += 2 #need to jump over 2 spaces
body_bytes = tokens[i]
i += 1
request_length = tokens[i]
i += 1
request_time = tokens[i]
i += 1
upstream_response_time, upstream_header_time = [],[]
while tokens[i][-1] == ',':
upstream_response_time.append(tokens[i][-1])
i += 1
upstream_response_time.append(tokens[i])
i += 1
while tokens[i][-1] == ',':
upstream_header_time.append(tokens[i][-1])
i += 1
upstream_header_time.append(tokens[i])
i += 1
cache = tokens[i]
i += 3 #need to jump over 3 spaces
server_name = tokens[i]
i += 1
http_host = tokens[i]
i += 1
scheme = tokens[i]#[:-1] if \\n in log entry
Note there might be out of format lines! Encapsulate a try catch in this code to deal with that.
Filter data.
To get the request pattern of IPFS we need the following requests that match the following restrictions:
We just need to apply these restrictions to the structured data we parsed before. Applying these to filter data can be easily done with pandas for example:
import pandas as pd #import for pandas
#load a csv file to a pandas dataframe
df = pd.read_csv(parsedLogFile, keep_default_na=False)
#filter for GET operations
df = df[df['op'] == 'GET']
#filter for successful operations
df = df.astype({'status': int})
df = df[(df['status'] >= 200) & (df['status'] < 300)]
#filter for remote requests
df = df[(df['ip'] != '127.0.0.1') & (df['ip'] != '::1')]
Extract CIDs from data.
We need to extract the CIDs from the data. CIDs can be usually found in the target field. However, there are instances in the logs, where the CID is the http_host field.
To this end, we need to join these two fields and search the CID in the resulting string.
CIDs can be found by first splitting the string by words searching for a starting substring:
import re #import for regex
link = http_host + target
cid = []
cid.extend(re.findall('Qm\\w+', link))
cid.extend(re.findall('baf\\w+', link))
if len(cid) == 1:
return cid[0]
elif len(cid) < 0:
return Nan
else:
## which cid should we return ? all ?
## return the first for now
return cid[0]
Get geo-location of requester IP
We need to find the geo-location of requesters to map the location of requests. To this end, use a geo-ip database such as MaxMind.
To get this data, we use a python library: python-geoip-geolite2
from geoip import geolite2 #import geoip database
match = geolite2.lookup(ip)
if match is not None:
continent = match.continent #return the continent
country = match.country #returns the country
regions = match.subdivisions #this will return a list of the regions
else:
## handle case when there is no value.
The result from the lookup might not always return a full value. It can so happen that only the continent is known for an IP address. Most matches will not have regions. Regions can be used for large countries, such as the USA, where each region will encode a state.
Note:
Putting all together.
The final result should be a dataset that associates each requested CID to an IP address and to the location of that IP address:
{
"cid": the requested CID
"ip": the IP address from where the request was made
"continent": the continent associated to the IP address
"country": the country associated to the IP address
"region": the region associated to the IP address
}
Following the example log entry we have been using, this should be the resulting entry:
{
"cid": QmPvt7yHLGpVhd4jVFX6reEZZ34fQRepQ1a1QTsZQBH1hJ
"ip": 199.83.232.55
"continent": NA
"country": US
"region": [FL]
}
Note:
In this dataset it is not useful to keep the data that has no information. Data such where no CID was found, or no location was found. So one should filter out non-existing values.
This step can be done when extracting the CIDs and the locations. For this, put the resulting data in a pandas dataframe, and put a Nan
value for non-existing values on the dataset.
This can then be filtered out easily with: df = df.dropna()
Extract unique CIDs from requested CIDs dataset.
This can be easily done on python with pandas:
import pandas as pd
# load requested cids dataset.
# add option of not keeping default na values, to not treat North America values (NA) as a Nan
df = pd.read_csv('requested_cids.csv', keep_defaul_na=False)
df = df[['cid']].drop_duplicates()
df.to_csv('unique_cids.csv', index=False)
Find the providers for these unique cids.
To do this, we need to connect to the IPFS DHT and perform get provider requests.
To connect to IPFS DHT we can write a simple go-libp2p program that uses the libp2p dht. The go-libp2p dht connects by default to the IPFS DHT.
The go-libp2p dht provides 2 main functions to get providers:
FindProviders(ctx context.Context, c cid.Cid) ([]peer.AddrInfo, error)
FindProvidersAsync(ctx context.Context, key cid.Cid, count int) <-*chan* peer.AddrInfo
Notice that these functions have different returns. The synchronous version returns a []peer.AddrInfo
while the asynchronous version returns a <-*chan* peer.AddrInfo
A peer.AddrInfo
represents a provider record that contains:
Addrs []ma.Multiaddr
ID ID
The way the code is structured (in go-libp2p-kad-dht v0.15.0
) is that the synchronous FindProviders
calls and waits for the asynchronous FindProvidersAsync
function.
The FindProviders
functions call the FindProviderAsync
function as such:
func (dht *IpfsDHT) FindProviders(ctx context.Context, c cid.Cid) ([]peer.AddrInfo, error) {
...
var providers []peer.AddrInfo
for p := range dht.FindProvidersAsync(ctx, c, dht.bucketSize) {
providers = append(providers, p)
}
return providers, nil
}
By default dht.bucketSize=20
this means that the FindProviders
function will try to find 20 providers for a CID.
A simple program in Go to fetch all (at a maximum of 20) providers for a given cid would look like this:
import (
"flag"
"context"
dht "github.com/libp2p/go-libp2p-kad-dht"
)
func main() {
cid := flag.String("cid", "", "CID to search")
kad, err := dht.New(context.Background(), h, dht.Mode(dht.ModeClient))
providers, err := kad.FindProviders(context.Background(), *cid)
if err != nil {
panic(err)
}
print(providers)
}
To later ease the data processing of the output of this program, the program should log easily parsable data, or already structured data.
A possible way would be with a function such as:
func logAnswer(a answer) {
if a.err != nil {
log.Println("Failed: ", a.cid, "err: ", a.err, " in peers: ", a.p, " time: ", a.dur)
} else {
log.Println("Found: ", a.cid, " in peers: ", a.p, " time: ", a.dur)
}
}
Where an answer has the following structure:
type answer struct {
p []peer.AddrInfo //the peer infos of providers
cid cid2.Cid //the cid that was searched
dur time.Duration //the duration of the request
err error //the error if occurred any, or nil
}
Process provider data.
Having the logs of the previous program, parse the logs. One way to do this is again with a simple python script. Assuming the logs have format presented above, the parsing strategy is the following.
Parsing strategy:
Locate each field in the log. This can be easily done with a regex:
"(.*) Found: (.*) in peers: (.*) time: (.*)"
Store each field in a variable.
Peers is a list with the following format:
[{<PeerID>: [<maddr1>, <maddr2>, ...]}, ... ]
To process this use the following regex to get all entries:
{(.*?): [(.*?)]}
The resulting python code to process the providers is the following:
import re #import for regex
providers = []
# find all matches of the regex
for match in re.finditer("{(.*?): \\[(.*?)\\]}", peers):
maddrs = []
if match is None: #if no match continue
continue
id = match.group(1)
addrs = match.group(2)
if id and addrs: #if exists both then process addrs
for addr in addrs.split(" "):
maddrs.append(addr)
providers.append((id, maddrs))
return providers
Extract the IP address from the multiaddress.
The multiaddress format is the following:
/<network protocol>/<network address>/<transport protocol>/<transport address>/...
We are interested in the network address which depends on the network field. Interested network protocols are the following:
The following python function can extract the ip address from the multiaddress:
import socket
import numpy as np
def extractIpsFromMaddr(maddr):
try:
splitted = maddr.split('/')
proto = splitted[1]
ip = splitted[2]
if proto == 'ip4' or proto == 'ip6':
return ip
elif 'dns' in proto:
try:
host = socket.gethostbyname(ip)
except:
return np.NAN
return host
else:
#print(maddr)
return np.NAN
except:
#print(maddr)
return np.NAN
Putting all together:
Similarly to the first step, the final result should be a dataset that associates each requested CID to an IP address and to the location of that IP address:
{
"cid": the requested CID
"ip": the IP address of the CID provider
"continent": the continent associated to the IP address
"country": the country associated to the IP address
"region": the region associated to the IP address
}
The location of the provider can be extracted similarly to the first step using the MaxMind geo-ip database.
Merge requester data with provider data
The resulting datasets from the previous steps can be viewed as two tables that share common value:
Requesters | Providers |
---|