This page contains a list of ideas for student projects for MFF UK. If you're interested in any of the projects, have your own idea, or would like to try an internship with us, please write to email@example.com
Note that this page is accessible at https://apify.com/matfyz
We're on a mission to make the web more programmable. Apify provides a cloud infrastructure and tools that let people automate anything a person can do manually in a web browser, and run it at scale. Our systems process billions of web pages and hundreds of terabytes of data every month. Our stack is based on AWS, Linux, Node.js, MongoDB, ...and dozens of other services.
Apify was founded in 2016 by two friends who met during their studies at MFF UK. Currently, we're about 40 people, based in an office in Prague's Lucerna palace. About 20% of the company is from Matfyz, so we 🙂
Learn more at apify.com/about or https://apify.com/jobs
TL;DR: Use machine learning to find the optimal strategy for rotating IP addresses in order to maximize data extraction potential from websites.
To succeed in the modern digital economy, companies need to be able to use data in order to make informed decisions, build better products or improve their sales and marketing. And the largest source of data ever created in the history of mankind is the web. The process of extracting structured data from unstructured websites is commonly known as web scraping.
Many websites employ technical protections to prevent access from automated systems and/or downloading larger amounts of data. One of the simplest and most efficient measures used by websites is blocking based on the IP address of the client. To work around such blocking, the web scraping and crawling systems employ pools of proxy servers and thus access websites from different IP addresses. The websites, on the other hand, look for access patterns and attempt to detect usage of proxies.
The goal of this project is to build a machine learning-based system that would take feedback from crawling bots telling it whether a specific IP address is blocked or not by a specific target website. The system would integrate with Apify Proxy (https://apify.com/proxy) and strive to learn the optimal strategy for rotation of IP addresses to reduce the chance of “burning” the proxies, minimize the blocking by target websites and provide the highest throughput over the proxies in the long term. Apify Proxy is developed in Node.js on top of the proxy-chain NPM package (https://www.npmjs.com/package/proxy-chain).
TL;DR: Many web pages have a common structure, e.g. news articles, product pages or job offers. Build a tool to automatically extract structured content from these pages, and thus enable its semantic analysis.
To succeed in the modern digital economy, companies need to be able to use data in order to make informed decisions, build better products or improve their sales and marketing. And the largest source of data created in the history of mankind is the web. The process of extracting structured data from unstructured websites is commonly known as web scraping.
Since websites, in general, have a diverse structure, one typically needs to manually set up a web scraper tailored for specific web pages, which is costly and doesn’t scale to a large number of sites. On the other hand, there are certain types of web pages that have a very similar structure, such as product pages, news articles or job offers. It might be possible to employ modern AI techniques to automatically extract structured data from these common types of web pages.