Software used for solution:

Key: Bold - software, programs, libraries and dependencies needed

July 15, 2025

The rest of the day was spent researching how to get started in finding the top 20 IPs and installing GoAccess, an open-source terminal-based web analyser. This first (GoAccess) approach aimed to maximise problem-solving skills while consciously being less reliant on ChatGPT. 3 hours was spent trying tools and dependencies were tried on WSL, Ubuntu, Git Bash, and PowerShell; by then, this approach became unfeasible.

July 16, 2025

Another way was to use Docker and Better Stack (Better) for free. The learning curve was not steep; one can learn both within days. Better—$500 monthly for a small team—tracks web traffic, though it is unsuitable for a quick solution, as it could take too long to verify every line for dashboard display. It would have taken 5 days with Git Bash to process each line in the log without memory overload. The app generated nothing, so I switched back to Python.

Top 20 Frequent IPs

# Libraries
import **re**
from **collections** import Counter

# From stack overflow
def get_ips(fname):
    # a pattern to match IPv4
    ip_re = re.compile(r'^\\s*(\\d+\\.\\d+\\.\\d+\\.\\d+)')
    with open(fname, encoding="utf-8") as file:
    # for each line in the file
      for line in file:
        ip_match = ip_re.search(line) # pass the line
        # extract the IP, or ignore if it does not have an IP
        if ip_match is not None:
          # group(131251) is the pattern in parenthesis, the ip.
          yield ip_match.group(1)
# Researched W3Schools
def count_logs():
    log = open("./sample-log.log", "rt")
    logs = log.readlines()
    number_of_logs = len(logs)

    log.close()

    print(str(number_of_logs))

count_logs()

ips = Counter(get_ips("sample-log.log"))
most_frequent_ips = ips.most_common(20) # gets the 20 more frequent IPs

print(most_frequent_ips)

image.png

There were 432096 requests in sample-log.log in total. This meant that while the 4-day global traffic increase was not massive, CPU and bandwidth were still greatly wasted on web crawling and cyberattacks. From experience, Python was easier than learning new applications for a short-term, high-priority task. Using the code above, I found the most common 20 client IPs.

Table

The greater the number of hits, the likelier the client IP belongs to a bot.

Notice an unusual difference in the magnitude of hits between #16 and #17. Safely, the bottom 4 of the list can then be ruled out as bots.

20 most frequent IPs

Program

Libraries were expected, see the count_logs() function which returns the number of logs in sample-log.log. After hours of coding a solution with ChatGPT and Stack Overflow, I made a FastAPI application that checked the log IPs against the top 16 IPs. The major downside was that the ASGI server immediately failed after the app was created, and it took 5 tries of uvicorn main_program:app --reload to run the app.