Key: Bold - software, programs, libraries and dependencies needed
July 15, 2025
The rest of the day was spent researching how to get started in finding the top 20 IPs and installing GoAccess, an open-source terminal-based web analyser. This first (GoAccess) approach aimed to maximise problem-solving skills while consciously being less reliant on ChatGPT. 3 hours was spent trying tools and dependencies were tried on WSL, Ubuntu, Git Bash, and PowerShell; by then, this approach became unfeasible.
July 16, 2025
Another way was to use Docker and Better Stack (Better) for free. The learning curve was not steep; one can learn both within days. Better—$500 monthly for a small team—tracks web traffic, though it is unsuitable for a quick solution, as it could take too long to verify every line for dashboard display. It would have taken 5 days with Git Bash to process each line in the log without memory overload. The app generated nothing, so I switched back to Python.
# Libraries
import **re**
from **collections** import Counter
# From stack overflow
def get_ips(fname):
# a pattern to match IPv4
ip_re = re.compile(r'^\\s*(\\d+\\.\\d+\\.\\d+\\.\\d+)')
with open(fname, encoding="utf-8") as file:
# for each line in the file
for line in file:
ip_match = ip_re.search(line) # pass the line
# extract the IP, or ignore if it does not have an IP
if ip_match is not None:
# group(131251) is the pattern in parenthesis, the ip.
yield ip_match.group(1)
# Researched W3Schools
def count_logs():
log = open("./sample-log.log", "rt")
logs = log.readlines()
number_of_logs = len(logs)
log.close()
print(str(number_of_logs))
count_logs()
ips = Counter(get_ips("sample-log.log"))
most_frequent_ips = ips.most_common(20) # gets the 20 more frequent IPs
print(most_frequent_ips)
There were 432096 requests in sample-log.log
in total. This meant that while the 4-day global traffic increase was not massive, CPU and bandwidth were still greatly wasted on web crawling and cyberattacks. From experience, Python was easier than learning new applications for a short-term, high-priority task. Using the code above, I found the most common 20 client IPs.
The greater the number of hits, the likelier the client IP belongs to a bot.
Notice an unusual difference in the magnitude of hits between #16 and #17. Safely, the bottom 4 of the list can then be ruled out as bots.
Libraries were expected, see the count_logs()
function which returns the number of logs in sample-log.log
. After hours of coding a solution with ChatGPT and Stack Overflow, I made a FastAPI application that checked the log IPs against the top 16 IPs. The major downside was that the ASGI server immediately failed after the app was created, and it took 5 tries of uvicorn main_program:app --reload
to run the app.