This is a blog post version of my Capstone project, which aims to improve information literacy on Wikipedia, using Machine Learning. I intend this post for non-technical readers.

There’s a problem

When’s the last time you checked out a Wikipedia page?

It’s probably today because even if you didn’t actively look for Wikipedia, it shows up on Google search as a Knowledge panel or Rich result. 60% of the time, Google will show Wikipedia text as their highlighted search results.

Knowledge Panel and Answer Box in Google search result. Source

Knowledge Panel and Answer Box in Google search result. Source

I first learned about Wikipedia’s influence on public opinion from a Public History class. Duh, what’s the first place you’ll reach if your teacher asks you to write about Vietnam war? Wikipedia!

Let’s compare these two entries from two different language versions. I searched up “terrorism” and “khủng bố” (terrorism in Vietnamese) to see how it’s mentioned in English and Vietnamese versions.

Untitled

Translation: “The Phoenix attempts mostly happened through terrorism, assassination. Black-coat Phoenix employees were trained by American intelligence and sent to villages, listen to intels, caught anyone suspected to be a Communist or a Communist ally, tortured them to get information. “

Translation: “The Phoenix attempts mostly happened through terrorism, assassination. Black-coat Phoenix employees were trained by American intelligence and sent to villages, listen to intels, caught anyone suspected to be a Communist or a Communist ally, tortured them to get information. “

English version’s use of “terrorism” is for North Vietnam/ Communist actions, while the Vietnamese version is for American-backed anti-Communists. Two pages contain opposite views about Vietnam War. Imagine a 16-year-old doing a history assignment, how much of their opinions will be shaped by which Wikipedia entry they encounter? Or simply, which language they searched Google with.

So Wikipedia, as a crowd-sourced, shared-memory space, can be biased. A big reason is because the contribution pool are mostly “(1) a male, (2) technically inclined, (3) formally educated, (4) an English speaker (native or non-native), (5) aged 15 to 49, (6) from a majority-Christian country, (7) from a developed nation, (8) from the Northern Hemisphere, and (9) likely employed as a white-collar worker or  enrolled as a student rather than employed as a laborer.” (Madsen-Brooks, 2013)

It’s scary, how influential Wikipedia text is in shaping our first impression of any subject, yet how fallible it is.

So what do we do about it?

What has been done about it?

Countless researchers have sailed to solve this question of improving Wikipedia credibility, or online web credibility in general. Naturally, they land in Natural Language Processing (NLP) techniques, leveraging the power of Machine Learning (ML).

Most often, a ML model to tell you whether a Wikipedia page is reliable would involve 3 parts: