— Solo project for UCLA's Introduction to Data Science class (COMM 188C)

Scope: 2 weeks, May 18 - May 29

Programs Used: R

*Code is listed in its entirety at the bottom of this page!


The Project

The premise for our final project was to find an existing dataset (or collect one yourself), devise a question that the dataset could answer, and then analyze the data to find that answer. It was intentionally vague; since the class was geared toward applying data science to the social sciences, our professor was more focused on encouraging our curiosity rather than critiquing our technical or analytical skillset.

My Topic

After exploring Kaggle for a while, I came across this dataset of over 40,000 YouTube videos that trended in late-2017 to mid-2018. The data, which contains the video’s title, tags, description, channel name, and numerous other features, was directly scraped from YouTube. As an avid YouTube consumer myself, I thought it would be interesting to dive into the process by which videos are placed on the Trending list.

YouTube’s “Trending” list showcases about 50 videos currently popular on the platform, thus serving as a window into the world’s events, sentiments, and trends at any given moment. Based on YouTube’s description, its algorithms determine “popularity” by considering a wide range of variables, including view count, virality, and location of viewers. In addition, videos must generally appeal to a wide range of viewers, not spread misinformation, and represent diversity of content and content creators. In other words, the video with the highest view count per day does not necessarily top the Trending list.

This has generated considerable mystery surrounding how to land one’s video on the “Trending” list, especially given its power to launch content — and their creators — into the mainstream. To provide insight into this process, I wanted to analyze key characteristics of videos on the Trending list in order to identify factors that are associated with “trending” status.

Data Preparation

Loading the Data

data <- read.csv("USvideos.csv")
data <- data[,-c(12:15)]

Cleaning the Data