— Solo project for UCLA's Data Analysis and Regression class (STATS 101A)

Scope: Winter Quarter 2020

Programs Used: R

Data: https://www.kaggle.com/c/fifa2019wages/data

*Code is listed in its entirety at the bottom of this page!

Background


The Project

Our professor made our final project into a Kaggle competition, in which we would use FIFA data on players' attributes and performance to predict their annual wage by creating a linear regression model. After creating the model based on this training data, we would be scored (and graded) based on the R^2 value of our model on a set of testing data.

Challenges

The data our professor provided was straight from FIFA, so it was up to us to clean it, which included handling missing values, formatting it correctly, and standardizing some of the values. Then, to actually build the model, we had to spend time trying to not only find the most valuable predictors but also to weight and transform them to create the best model.

Data Preparation


Loading the Data

test <- read.csv('FifaNoY.csv')
testinit <- test
train <- read.csv('FifaTrainNew.csv')
traininit <- train

Cleaning the Data

I wrote a function to clean the data, which I could then apply to the testing data. It included reformatting many of the columns into forms R could actually recognize and use in a linear regression model.