https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9f5b8539-09b2-4ed8-8b0f-412164f59e1e/Untitled_presentation_(2).png

Hey everyone, after being selected for GSoC 2020, there's a great summer full of open source development. This is the first weekly check-in where I've shared my experience stepping into the program.

Hi everyone, I'm Satyam and this summer I've been selected for GSoC with Kiwix. I am working on a project titled "Improve Python Scrapers". Kiwix is an organization that basically aims to archive the internet. Yup! You read it right - ARCHIVE THE INTERNET. They develop scrapers to download the best content from the internet and make ZIMs out of it. Now, you'd be wondering, what's a ZIM? It's essentially a special file format to store content from the internet as "articles" with support for images, videos, and much more. I came across this awesome organization while skimming through the GSoC organization list and when I dived deeper, man, I knew that this is the one I'm gonna work with.

I'm working in the openZIM organization on GitHub which is maintained by Kiwix and controls the ZIM format specification and all the scrapers that are used to create them.

Okay, that seems to be enough of an introduction. Let's dive into the details.

What did I do this week ?

So, this week started with a discussion with my mentor Renaud and we quickly discussed some small things regarding the TED scraper and zimscraperlib. We basically needed a video module in zimscraperlib and needed to add some enhancements to the TED scraper which I rewrote during the previous week (yeah, I contributed to Kiwix before the program officially started). We needed an S3 based optimization cache, support for WebM on platforms that didn't support it natively (Hey Apple, I'm looking at you), and removal of the javascript dependencies from the repo.

The full offline Wikipedia weighs merely 80GB, and that's with pictures. The ZIM format is incredible!

Talking about what has been done till now, we've already added the 3 aforementioned enhancements that we planned and are working on adding audio language-based filtering in it. For zimscraperlib, I've created a draft PR for the video module and hopefully, it'll make it to the master branch in the coming days.

What's ahead now?

After I complete the video module in zimscraperlib and audio based filtering in TED scraper, the next step would be to support localization in TED scraper, add a batch ZIM creation script in YouTube scraper and work o various other bugs/fixes/enhancements. I'll also be most probably having a co-programming session with my mentor and hopefully we'll release version 2.0 of the ted scraper.

Challenges I faced

There are challenges in everything and yes, I faced them too. The most challenging thing that I think I did was to completely understand how the TED language system worked as there seems to be many different languages and locale formats in use. Anyways, It was fun understanding these and going through all the languages that TED has content available in.

Verdict

Let me tell you, GSoC is an experience to experience the awesome work that goes into Open Source development and I'd recommend that everyone should try and experience it once in their lifetime. It actually takes you on an amazing journey where you experience self-transformation, interact with awesome people, and most importantly, do cool stuff.

Ah, maybe I put a lot of "experience" in the 3rd last line.